Early registration for Tutorials
- Image processing and Tensorflow in R (Raymond Ellis and Greg Streatfield)
The Nature Conservancy works to preserve marine ecosystems for the future. In November 2016, they asked data scientists to compete at building deep learning algorithms that can detect and classify species of fish from still images. We set out to compete in their competition, using R.
In this session, we explore how to preprocess and prepare your data for deep learning. We then demonstrate how to use TensorFlow in R to build a solution to the challenge.
- Redis + R = Multi-user R apps (David Lubinsky)
There are many options for data persistence from R; from SQL server to Mongo but one option that is fast, powerful, rich and very well suited to R programming is Redis. The combination of data structures like queues, ordered lists, hash sets with a light in-memory footprint makes Redis the ideal choice for apps that have a high transaction rate, and many users. In this tutorial, we will show how easy is it is to build R applications with Redis and in particular, how Shiny apps can share back end data through a Redis interface.
- Shiny / R and Devops (Marko Jakovljevic)
Initial background will speak about a DevOps approach to Data science projects showcasing how to use and orchestrate the Docker Container service for Shiny Server / Flask+Bokeh+Pandas (Either Python or R stack).
Further focus will be on securing the Open Source version of Shiny Server with various approaches including OAuth Implementation such as Auth0.com or custom Authentication (Nginx or Apache Stack).
Can also share quick and easy Dockerfiles / Containers for Deep Learning / Data Science (Scipy etc.) which I have built over time to include all the required libraries and code for our Dev Teams.
Registration / Coffee
Welcome & Opening
Keynote #1: Text mining, the tidy way (Julia Silge)
Talks (Business Applications of R)
- Quantitative Strategy Development & Evaluation in R (Jasen Mackie)
R/quantstrat is an R package built by leaders in the R/Finance community allowing users to efficiently run backtests of trading strategies over multi-asset portfolios. I will run through a basic implementation of the package, and illustrate the tools available for evaluating backtest results, including the R/blotter package.
For a basic idea of what some of the content will look like you can view Brian Peterson's presentation to the CapeRUser Group which I arranged in July this year. Slides are available here.
Depending on the progress I make with the txnsim() R/blotter function by then, I may include it with my talk on evaluating the relative performance of the strategy versus its randomized equivalent with the objective being to assess skill versus luck or overfitting.
- Visualizing the approach and exceedance of thresholds (Glenn Moncrieff)
Visualizing thresholds is useful for understanding how they are approached and exceeded, and assisting in planning corrective action. Borrowing from applications in the environmental sciences that have been used to highlight threats to global ecological health, we show how the same technique can be applied to visualize the performance of currency traders. Six trading rules are used to define thresholds for acceptable trader behaviour. Trading data are analyzed to produce reports aimed at encouraging poorly performing traders to undertake corrective behaviour. Radar plots for the visualization of thresholds are constructed using ggplot2 and the generation of reports with Rmarkdown is automated. A Shiny interface allows traders to easily view and download their personal report. This approach for comparing behaviour to predefined acceptable thresholds is applicable to a wider array of problems beyond trading behaviour or environmental sustainability.
- R at UBER (Marc van Heerden)
An overview of the use cases and workflow employed when using R at UBER. Internal packages are discussed as well as the manner in which R is used in conjunction with other systems to perform ad hoc and scheduled tasks.
- Rapid Data Science Application Deployment with R Shiny (Nkululeko Thangelane)
The advantage of using Shiny to Deploy Data Science Application's. Showing the flexibility that shiny bring to build complex and simple applications for power users and simple users. A call volumes forecasting application will be show cased to show how a powerful app used by business was developed in R Shiny.
- Who goes there? Profiling street by street audience with Telco data (Kael Huerta)
This talk shows how to monetize Telco data while respecting users privacy with a street by street profile of the people crossing by. Such profile includes predicting age group, gender, and interests based upon apps usage, web history and passive location (only when the cellphone is used). All implemented in R.
- Inventory Forecasting using R (Warren Allworth & Peter Gross)
Our client required an analytical solution that would assist in optimizing the inventory levels to satisfy customer demand whilst reducing holding costs and overall footprint in the warehouse. The project focused on using time series analysis for products that exhibited seasonal patterns as well as stochastic modelling for non-seasonal parts (random demand).
The output of the forecast as well as metadata were then used to determine inventory management thresholds incorporating supplier lead times and order cycles.
Keynote #2: Data Rectangling (Jenny Bryan)
Talks (R in the Sciences)
- Using R for Oceanography (Anne Treasure and Katrin Tirok)
Autonomous oceanographic sampling devices such as gliders, Argo floats and animal-borne instruments have become a major component of the ocean observing system and have proven invaluable to the ocean science community. Several steps are involved to get from the raw data to scientific products. While R is not well known or traditionally used in physical oceanography, there are packages and tools available well suited to this field. Seawater characteristics can be calculated from temperature and conductivity using available ocean science R packages such as ‘oce’ or ‘gsw'. ‘Oce’ can also be used to plot data density distribution maps. For subsequent analyses, statistical and geostatistical packages for R are helpful, e.g. kriging of variables for spatial interpolation using ‘gstat’. Packages such as ‘ggplot’ and ‘ggmap' are useful for the visualisation of data in the form of contour plots and current velocities.
Here, we will showcase these uses of R for oceanography by highlighting results from two research studies where data from autonomous oceanographic sampling devices have been used. The successful use of autonomous devices can depend on the study region of interest. Therefore, we will first show results of a spatial and temporal comparison of data from Argo floats and animal-borne instruments in the Southern Ocean along with some functionality of the package ‘oce’. Second, a more in depth look at subsequent analysis of data will be shown, using data from gliders in the Agulhas current off Northern KwaZulu-Natal.
- Impact of fishing on African penguin population off the West Coast (Andrea Ross-Gillespie)
The African Penguin population has been declining steadily, likely as a result of dramatic changes in the anchovy and sardine abundances, as these species form the primary component of the penguins' diets. In 2008, an experiment was initiated whereby an area of a 10nm radius around each of four selected breeding islands (Dassen and Robben islands on the West Coast, and Bird and St Croix on the South Coast) was closed to the fishery. The aim of the experiment is to see whether the fishing activity around the breeding island has a negative impact on the penguin population, or conversely whether closing the island to the fishery benefits the penguins in a meaningful way.
Analysis of the collected penguin population data however did not yield conclusive results, leading to the question of whether or not the experiment should continue. The challenge is to balance (on the one hand) the risk of concluding that there is no meaningful impact on the penguin population when in fact there is an impact that has just not been detected yet in the data, and (on the other hand) continuing the closure experiment when there is in fact no meaningful beneficial impact on the penguin population, at a great cost to the fishing industry. To address this challenge, we embarked on a form of a power analysis, which aims to answer the following question: If there is a biologically meaningful impact of the fishery on the penguin population, how long would the island closure experiment need to continue for before we are likely to detect this impact?
This question, and the way to address it, has been the discussion of the International Stock Assessment Workshop held at UCT over the last two years. The work was concluded in December 2016, with the final recommendations by the panel for the workshop leading to the analyses presented at a recent working group meeting of the Fisheries Branch of the Department of Agriculture, Forestry and Fisheries to inform the decision on the future of the penguin island closure experiment.
- MicRobiome Analysis (Katie Lennard and Kirsty Lee Garson)
More than half of the cells which make up our bodies are bacterial. The human body is home to a diverse array of bacteria and other microorganisms, collectively known as the human microbiome. Alterations in the microbiome have been linked to a wide variety of diseases including cancer, asthma, obesity, and depression.
Over the last decade, recent advances in DNA sequencing technology have facilitated rapid progress in microbiome research, which has been met with the equally rapid development of data analysis methods - many of which are implemented in R.
Here we briefly introduce R-based microbiome analysis, using the role of the microbiome in HIV susceptibility as an example. Creating graphics in R for high dimensional data serves as a first step in data exploration. As an example, we will demonstrate the use of annotated heatmaps, a versatile tool with a wide range of applications.
- Visual modelling with pavo (Jeroen van der Merwe)
Using the package pavo we created visual models to obtain distances in visual space between populations of restio leaf hoppers. Shorter distances in visual space showing closer matches in colour. This formed part of a study to see whether local adaptation of leaf hoppers to local host plants has occurred.
- Helping to ease the pain with R/RMarkdown (Peter Kamerman)
We recently submitted an application to the World Health Organisation for the inclusion of a medicine (gabapentin) used to manage pain caused by nerve damage on its 'List of Essential Medicines'. Wanting to make the the process of generating the report as transparent and reproducible as possible, we decided to give R/RMarkdown a try. Using these tools, sprinkled with bits of LaTeX that had to learn on on the fly, we generated the full application, a supplementary online storyboard, and made all the code and data available online. I will present some of the cool data on the burden of chronic pain (especially neuropathic pain) and analgesic medicine availability that we pulled together from various sources, and how we used R and some of the many static and interactive plotting tools available in the R ecosystem to analyse and visualise these data in a meaningful and (hopefully) compelling manner.
- Prospects for Trachoma Elimination through targeted mass treatment (Laing Lourens)
Trachoma is the worldwide leading infectious cause of blindness. Nearly 20 years ago, the World Health Organisation (WHO) issued the goal of eliminating trachoma induced blindness by the year 2020. Previous evidence has shown that targeted treatment to children less than 10 years of age is able to reduce prevalence across an entire community.
A Markov model of trachoma transmission that assumes two age classes is presented, with parameters estimated using an accept/reject procedure fitted to data from a previous clinical trial. Based on the best fitting parameter sets, mass drug administration at different coverage and periodicity is simulated to assess its impact on population level prevalence.
- Using R to understand Human Evolution (Kerryn Warren)
The hominin (human) fossil record is scant, and palaeoanthropologists frequently rely on a variety of specialised programmes to make sense of our own evolution. R, however, has allowed for the integration of a variety of techniques, such as collecting data on 3D scans, analysing morphological and GIS data and producing effective figures for interpretation of trends.
The trend towards using R for analysing fossil records has allowed for greater collation of techniques (internationally) and more openness with results.
Keynote #3: The you in community (Stephanie Locke)
- Microsoft, Open Source, R: You gotta be kidding! (Niels Berglund)
In this talk we will have a look at Microsoft R Server, which is a High Performance Computing and a High Performance Analytics R implementation. The talk is highly code-driven, and we will do comparisons between CRAN R amd the Microsoft implementation.
- R and PowerBI: best frenemies (Michael Johnson)
For many years, there has been a rivalry between Microsoft and the open source community but that is changing. PowerBI is Microsoft’s new data tool that allows data analysts to create rich interactive reports with support for data preparation, modeling, and visualization and now includes R integration.
In this session, we'll look at how PowerBI and R integrate using R scripts allowing the data analyst to leverage the strengths of each tool.
- High performance R: integrating cpp into your workflow (Robert Bennetto)
The talk will provide motivations to integrating cpp into your workflow as a data scientist - the result of which can dramatically improve the overall performance of your R code. A brief discussion of the R interpreter and a comparison to compiled languages will be provided with examples to substantiate the motivation.
The types of computational tasks that lend themselves to a cpp approach will be discussed. An overview of the cpp primitives available out-the-box is provided.
- Using visNetworks to Visualize Vine Copula (Hanjo Odendaal)
Currently the visual illustration of Vine copulae from the VineCopula package offers a bland plotting output. Combining the visNetwork html-widget along with the VineCopula RVM output, offers an interactive and visually appealing way to understand to interpret your output.
- R as a GIS: Introduction to spatial data visualization and manipulation (Jacques Booysen)
This talk is an introduction to using spatial data in R, giving an overview as well as practical application of how spatial data can be created, manipulated and visualised using the R platform.It assumes no prior knowledge of spatial data analysis but prior understanding of the R command line would be beneficial. Practical applications will include: using the Google Elevation API with Google Maps Visualization using R, Spatial Interpolation/Modeling of Temperature Data in R and GIS climate change data on Amazon S3 using R.
- Energy planning for climate change using R, StarCluster and Shiny (Schalk Heunis)
StarCluster is an open source grid computing framework that was used with R and Shiny to produce a geospatial visual interactive dashboard to inform energy planners about the long term impact of climate change on energy supply in Africa.
StarCluster runs on Amazon EC2 and allowed horizontal scaling of MIP optimization. We were able to produce optimal energy solutions over thousands of future climate/socio-economic scenarios and decision. These solutions are then presented to decision makers who can interact with the dashboard to build intuition, understand risks and discover opportunities.
This talk is about using horizontal scaling to deal with uncertainty in decision making using R, Shiny and StarCluster on Amazon EC2.
- Open Data for Better Decision Making (Matthew Adendorff)
We live in the Data Age and now have the ability to ingest huge quantities of information to power predictions and insights. A caveat to this new-found computational capacity is that the adage, Garbage in, garbage out, is still as relevant as ever, if not more so. In addition, the distillation of meaningful information from the Internet's data fire-hose takes careful processing and reduction; and a well-manicured dataset is worth its weight in gold.
Fortunately, in tandem with this current rise in access to information, the open data movement is rapidly approaching maturation and governments / civic society are now providing powerful tools and insight-rich data repositories free-of-charge to whomever wants to utilize them. The inclusion of such sources and technologies in modern Big Data infrastructure can provide a powerful platform for impactful analyses.
In this talk I will present the potential that these open data sources present for modern computational pipelines, and will discuss some successful applications of this paradigm to South African challenges.
Data Visualisation Challenge