SATRDAY

The R community and some of South Africa's most forward thinking companies have come together to bring satRday to Cape Town. This conference brings an opportunity to hear from and network with top Researchers, Data Scientists and Developers from all over the country and the world.

Join the R community in registering for this exciting day dedicated to R.

Speakers

Keynote Speakers

Steph Locke

Data Scientist

CensorNet Ltd

Security Data Scientist during the day and by night a user group leader locally, conference organiser nationally, and presenter globally. Steph loves to help folk share knowledge.

Jennifer Bryan

Developer & Professor

RStudio / UBC

Jennifer Bryan is a recovering statistician who enjoys making it easier to do data analysis with R. She's working at RStudio on open source R packages and can often be found in the tidyverse. Jenny is on leave from the University of British Columbia, where she is Associate Professor of Statistics and the Academic Director of the Master of Data Science program. She shares teaching material at STAT545.com, serves in the leadership of rOpenSci, and is a member of the R Foundation.

Julia Silge

Data Scientist

Stack Overflow

Julia is a Data Scientist at Stack Overflow, where her work involves analyzing and modeling complex data sets and communicating about technical topics with diverse audiences. She has a PhD in Astrophysics, as well as abiding affections for Jane Austen and making beautiful charts. Julia worked in academia and ed tech before moving into data science and discovering R.

Speakers

Raymond Ellis

Freelance

Machine Learning Specialist

Gregory Streatfield

Head of Data and Analytics

Ogilvy & Mather

David Lubinsky

Managing Director

OPSI Systems

Marko Jakovljevic

Founder / CEO

Overscore

Jasen Mackie

Product Owner

Iress

Glenn Moncrieff

Data Scientist

Ixio Analytics

Marc van Heerden

Analytics Manager

UBER

Nkululeko Thangelane

Data Scientist

Standard Bank

Kael Huerta

Data Scientist

Toptal

Warren Allworth

Data Scientist

Aculocity

Anne Treasure

Postdoc

University of Cape Town

Katrin Tirok

Scientist

University of KwaZulu-Natal

Andrea Ross-Gillespie

Postdoc

University of Cape Town

Katie Lennard

Postdoctoral Fellow

University of Cape Town

Kirsty Lee Garson

PhD Candidate

University of Cape Town

Jeroen van der Merwe

Freelance Researcher

Independent

Peter Kamerman

Associate Professor

University of the Witwatersrand

Laing Lourens

Data Science Intern

Meraka Institute, CSIR

Kerryn Warren

PhD Student

University of Cape Town

Niels Berglund

Software Specialist

Derivco

Michael Johnson

BI Consultant

Independent

Robert Bennetto

Head of Consulting

Pivot Sciences

Hanjo Odendaal

Data Scientist

Eighty20 Analytics

Jacques Booysen

Data Scientist

EOH

Schalk Heunis

Data Scientist

EOH ESA

Matt Adendorff

Lead Technologist

Open Data Durban

Workshops

The satRday Capetown conference will kick off on the 16^th of February 2017* with 2 days of workshops held by our 3 international Keynote Speakers.

(*) Workshops available to training pass holders only.

Shiny

Julia Silge

Shiny is a framework for building interactive web apps and dashboards using the statistical programming language R. In this workshop, we will walk through how to move from an analysis performed in R, with plots and tables that an analyst wants to share, to a self-contained web app. We will learn:

what reactive programming is,
how to use the flexdashboard package to customize your app, and
how to publish and share your app.

Attendees should bring a laptop with the latest versions of R and RStudio installed, and we will build a Shiny app from scratch together.

Git with R

Jennifer Bryan

Using Git and GitHub with R, RStudio and R Markdown.

Data analysts can use the Git version control system to manage a motley assortment of project files in a sane way (e.g. data, code, reports, etc.). This has benefits for the solo analyst and, especially, for anyone who wants to communicate and collaborate with others. Git helps you organize your project over time and across different people and computers. Hosting services like GitHub, Bitbucket and GitLab provide a home for your Git-based projects on the internet.

What's special about using R and Git(Hub)?

the active R package development community on GitHub
workflows for R scripts and R Markdown files that make it easy to share source and rendered results on GitHub
Git and GitHub-related features of the RStudio IDE

The tutorial will be structured as ~5 task-oriented units. Indicative topics:

The most difficult part: installation and configuration!
Creating a Git repository and connecting the local repo to a GitHub remote, for new and existing projects.
The intersection of GitHub and the R world: R packages developed on GitHub and how to make use of issues; METACRAN, a read-only mirror of all CRAN R packages; R-specific search tips.
Daily workflows and FAQ: how often should I commit?, which files should I commit? how do I change a commit or its message? how do groups of 1, 5, or 10 people structure their work with Git(Hub)? etc.

This will be a hands-on tutorial, so bring your prepared laptop and pre-register a free GitHub account.

Building and Validating Logistic Regression Models

Steph Locke

This workshop will take you through building a robust and reproducible logistic regression model from end-to-end. Starting from data ingestion to a fully documented model, we'll be covering:

The fundamentals: what logistic regression models are and what they're good for.
Data preparation techniques including univariate analysis, normalisation, binning, and feature engineering.
Sampling techniques including simple methods and bootstrapping.
Model generation including automated feature selection methods, and addressing performance issues.
Model validation including common model measures, out of time samples, and analysis of groups.

All of this will be done within a reproducible framework using R and markdown. It is intended for people who would like to deepen their understanding of building models and be able to reproduce their results, especially for academic or regulatory reasons.

This is a hands-on workshop with an emphasis on the practical implementation of techniques so you should be relatively comfortable coding with R and have used rmarkdown or knitr before. You should also bring a laptop with the latest versions of R and RStudio installed.

Programme

Workshops Programme

Start	End	Thursday 16 February 2017
8:00	8:30	Registration / Coffee
8:30	10:00	Git with R (Jennifer Bryan): Part I Resources: Happy Git and GitHub for the useR https://starlogs.net/
10:00	10:30	Coffee
10:30	12:00	Git with R (Jennifer Bryan): Part II
12:00	13:00	Lunch
13:00	15:00	Building and Validating Logistic Regression Models (Steph Locke): Part I Resources: ReproducibleGLM ReproducibleGLM - step 0 ReproducibleGLM - step 1 ReproducibleGLM - step 2 ReproducibleGLM - step 3 ReproducibleGLM - step 4
15:00	15:30	Coffee
15:30	17:00	Building and Validating Logistic Regression Models (Steph Locke): Part II

Start	End	Friday 17 February 2017
8:30	10:00	Building and Validating Logistic Regression Models (Steph Locke): Part III
10:00	10:30	Coffee
10:30	12:00	Shiny (Julia Silge): Part I Resources: An Introduction to Shiny southafricastats: Population and Mortality Statistics for South Africa
12:00	13:00	Lunch
13:00	15:00	Shiny (Julia Silge): Part II
15:00	15:30	Coffee
15:30	17:00	Shiny (Julia Silge): Part III

Conference Programme

Tutorials are 1 hour long; standard talks are 20 minutes and lightning () talks are a mere 5 minutes.

Click on the title for any talk to view the details.

Start	End	Saturday 18 February 2017
7:30	8:00	Early registration for Tutorials
8:00	9:00	Tutorials Image processing and Tensorflow in R (Raymond Ellis and Greg Streatfield) The Nature Conservancy works to preserve marine ecosystems for the future. In November 2016, they asked data scientists to compete at building deep learning algorithms that can detect and classify species of fish from still images. We set out to compete in their competition, using R. In this session, we explore how to preprocess and prepare your data for deep learning. We then demonstrate how to use TensorFlow in R to build a solution to the challenge. Redis + R = Multi-user R apps (David Lubinsky) There are many options for data persistence from R; from SQL server to Mongo but one option that is fast, powerful, rich and very well suited to R programming is Redis. The combination of data structures like queues, ordered lists, hash sets with a light in-memory footprint makes Redis the ideal choice for apps that have a high transaction rate, and many users. In this tutorial, we will show how easy is it is to build R applications with Redis and in particular, how Shiny apps can share back end data through a Redis interface. Shiny / R and Devops (Marko Jakovljevic) Initial background will speak about a DevOps approach to Data science projects showcasing how to use and orchestrate the Docker Container service for Shiny Server / Flask+Bokeh+Pandas (Either Python or R stack). Further focus will be on securing the Open Source version of Shiny Server with various approaches including OAuth Implementation such as Auth0.com or custom Authentication (Nginx or Apache Stack). Can also share quick and easy Dockerfiles / Containers for Deep Learning / Data Science (Scipy etc.) which I have built over time to include all the required libraries and code for our Dev Teams.
8:30	9:30	Registration / Coffee
9:30	9:40	Welcome & Opening
9:40	10:30	Keynote #1: Text mining, the tidy way (Julia Silge) Beulah Snyman
10:30	11:45	Talks (Business Applications of R) Jon Calder Quantitative Strategy Development & Evaluation in R (Jasen Mackie) R/quantstrat is an R package built by leaders in the R/Finance community allowing users to efficiently run backtests of trading strategies over multi-asset portfolios. I will run through a basic implementation of the package, and illustrate the tools available for evaluating backtest results, including the R/blotter package. For a basic idea of what some of the content will look like you can view Brian Peterson's presentation to the CapeRUser Group which I arranged in July this year. Slides are available here. Depending on the progress I make with the txnsim() R/blotter function by then, I may include it with my talk on evaluating the relative performance of the strategy versus its randomized equivalent with the objective being to assess skill versus luck or overfitting. Visualizing the approach and exceedance of thresholds (Glenn Moncrieff) Visualizing thresholds is useful for understanding how they are approached and exceeded, and assisting in planning corrective action. Borrowing from applications in the environmental sciences that have been used to highlight threats to global ecological health, we show how the same technique can be applied to visualize the performance of currency traders. Six trading rules are used to define thresholds for acceptable trader behaviour. Trading data are analyzed to produce reports aimed at encouraging poorly performing traders to undertake corrective behaviour. Radar plots for the visualization of thresholds are constructed using ggplot2 and the generation of reports with Rmarkdown is automated. A Shiny interface allows traders to easily view and download their personal report. This approach for comparing behaviour to predefined acceptable thresholds is applicable to a wider array of problems beyond trading behaviour or environmental sustainability. R at UBER (Marc van Heerden) An overview of the use cases and workflow employed when using R at UBER. Internal packages are discussed as well as the manner in which R is used in conjunction with other systems to perform ad hoc and scheduled tasks. Rapid Data Science Application Deployment with R Shiny (Nkululeko Thangelane) The advantage of using Shiny to Deploy Data Science Application's. Showing the flexibility that shiny bring to build complex and simple applications for power users and simple users. A call volumes forecasting application will be show cased to show how a powerful app used by business was developed in R Shiny. Who goes there? Profiling street by street audience with Telco data (Kael Huerta) This talk shows how to monetize Telco data while respecting users privacy with a street by street profile of the people crossing by. Such profile includes predicting age group, gender, and interests based upon apps usage, web history and passive location (only when the cellphone is used). All implemented in R. Inventory Forecasting using R (Warren Allworth & Peter Gross) Our client required an analytical solution that would assist in optimizing the inventory levels to satisfy customer demand whilst reducing holding costs and overall footprint in the warehouse. The project focused on using time series analysis for products that exhibited seasonal patterns as well as stochastic modelling for non-seasonal parts (random demand). The output of the forecast as well as metadata were then used to determine inventory management thresholds incorporating supplier lead times and order cycles.
11:45	12:45	Lunch
12:45	13:35	Keynote #2: Data Rectangling (Jenny Bryan) Alice Coyne
13:35	14:40	Talks (R in the Sciences) Andrew Collier Using R for Oceanography (Anne Treasure and Katrin Tirok) Autonomous oceanographic sampling devices such as gliders, Argo floats and animal-borne instruments have become a major component of the ocean observing system and have proven invaluable to the ocean science community. Several steps are involved to get from the raw data to scientific products. While R is not well known or traditionally used in physical oceanography, there are packages and tools available well suited to this field. Seawater characteristics can be calculated from temperature and conductivity using available ocean science R packages such as ‘oce’ or ‘gsw'. ‘Oce’ can also be used to plot data density distribution maps. For subsequent analyses, statistical and geostatistical packages for R are helpful, e.g. kriging of variables for spatial interpolation using ‘gstat’. Packages such as ‘ggplot’ and ‘ggmap' are useful for the visualisation of data in the form of contour plots and current velocities. Here, we will showcase these uses of R for oceanography by highlighting results from two research studies where data from autonomous oceanographic sampling devices have been used. The successful use of autonomous devices can depend on the study region of interest. Therefore, we will first show results of a spatial and temporal comparison of data from Argo floats and animal-borne instruments in the Southern Ocean along with some functionality of the package ‘oce’. Second, a more in depth look at subsequent analysis of data will be shown, using data from gliders in the Agulhas current off Northern KwaZulu-Natal. Impact of fishing on African penguin population off the West Coast (Andrea Ross-Gillespie) The African Penguin population has been declining steadily, likely as a result of dramatic changes in the anchovy and sardine abundances, as these species form the primary component of the penguins' diets. In 2008, an experiment was initiated whereby an area of a 10nm radius around each of four selected breeding islands (Dassen and Robben islands on the West Coast, and Bird and St Croix on the South Coast) was closed to the fishery. The aim of the experiment is to see whether the fishing activity around the breeding island has a negative impact on the penguin population, or conversely whether closing the island to the fishery benefits the penguins in a meaningful way. Analysis of the collected penguin population data however did not yield conclusive results, leading to the question of whether or not the experiment should continue. The challenge is to balance (on the one hand) the risk of concluding that there is no meaningful impact on the penguin population when in fact there is an impact that has just not been detected yet in the data, and (on the other hand) continuing the closure experiment when there is in fact no meaningful beneficial impact on the penguin population, at a great cost to the fishing industry. To address this challenge, we embarked on a form of a power analysis, which aims to answer the following question: If there is a biologically meaningful impact of the fishery on the penguin population, how long would the island closure experiment need to continue for before we are likely to detect this impact? This question, and the way to address it, has been the discussion of the International Stock Assessment Workshop held at UCT over the last two years. The work was concluded in December 2016, with the final recommendations by the panel for the workshop leading to the analyses presented at a recent working group meeting of the Fisheries Branch of the Department of Agriculture, Forestry and Fisheries to inform the decision on the future of the penguin island closure experiment. MicRobiome Analysis (Katie Lennard and Kirsty Lee Garson) More than half of the cells which make up our bodies are bacterial. The human body is home to a diverse array of bacteria and other microorganisms, collectively known as the human microbiome. Alterations in the microbiome have been linked to a wide variety of diseases including cancer, asthma, obesity, and depression. Over the last decade, recent advances in DNA sequencing technology have facilitated rapid progress in microbiome research, which has been met with the equally rapid development of data analysis methods - many of which are implemented in R. Here we briefly introduce R-based microbiome analysis, using the role of the microbiome in HIV susceptibility as an example. Creating graphics in R for high dimensional data serves as a first step in data exploration. As an example, we will demonstrate the use of annotated heatmaps, a versatile tool with a wide range of applications. Visual modelling with pavo (Jeroen van der Merwe) Using the package pavo we created visual models to obtain distances in visual space between populations of restio leaf hoppers. Shorter distances in visual space showing closer matches in colour. This formed part of a study to see whether local adaptation of leaf hoppers to local host plants has occurred. Helping to ease the pain with R/RMarkdown (Peter Kamerman) We recently submitted an application to the World Health Organisation for the inclusion of a medicine (gabapentin) used to manage pain caused by nerve damage on its 'List of Essential Medicines'. Wanting to make the the process of generating the report as transparent and reproducible as possible, we decided to give R/RMarkdown a try. Using these tools, sprinkled with bits of LaTeX that had to learn on on the fly, we generated the full application, a supplementary online storyboard, and made all the code and data available online. I will present some of the cool data on the burden of chronic pain (especially neuropathic pain) and analgesic medicine availability that we pulled together from various sources, and how we used R and some of the many static and interactive plotting tools available in the R ecosystem to analyse and visualise these data in a meaningful and (hopefully) compelling manner. Prospects for Trachoma Elimination through targeted mass treatment (Laing Lourens) Trachoma is the worldwide leading infectious cause of blindness. Nearly 20 years ago, the World Health Organisation (WHO) issued the goal of eliminating trachoma induced blindness by the year 2020. Previous evidence has shown that targeted treatment to children less than 10 years of age is able to reduce prevalence across an entire community. A Markov model of trachoma transmission that assumes two age classes is presented, with parameters estimated using an accept/reject procedure fitted to data from a previous clinical trial. Based on the best fitting parameter sets, mass drug administration at different coverage and periodicity is simulated to assess its impact on population level prevalence. Using R to understand Human Evolution (Kerryn Warren) The hominin (human) fossil record is scant, and palaeoanthropologists frequently rely on a variety of specialised programmes to make sense of our own evolution. R, however, has allowed for the integration of a variety of techniques, such as collecting data on 3D scans, analysing morphological and GIS data and producing effective figures for interpretation of trends. The trend towards using R for analysing fossil records has allowed for greater collation of techniques (internationally) and more openness with results.
14:40	15:10	Coffee
15:10	16:00	Keynote #3: The you in community (Stephanie Locke) Katie Lennard
16:00	17:20	Talks (General) Etienne Koen Microsoft, Open Source, R: You gotta be kidding! (Niels Berglund) In this talk we will have a look at Microsoft R Server, which is a High Performance Computing and a High Performance Analytics R implementation. The talk is highly code-driven, and we will do comparisons between CRAN R amd the Microsoft implementation. R and PowerBI: best frenemies (Michael Johnson) For many years, there has been a rivalry between Microsoft and the open source community but that is changing. PowerBI is Microsoft’s new data tool that allows data analysts to create rich interactive reports with support for data preparation, modeling, and visualization and now includes R integration. In this session, we'll look at how PowerBI and R integrate using R scripts allowing the data analyst to leverage the strengths of each tool. High performance R: integrating cpp into your workflow (Robert Bennetto) The talk will provide motivations to integrating cpp into your workflow as a data scientist - the result of which can dramatically improve the overall performance of your R code. A brief discussion of the R interpreter and a comparison to compiled languages will be provided with examples to substantiate the motivation. The types of computational tasks that lend themselves to a cpp approach will be discussed. An overview of the cpp primitives available out-the-box is provided. Using visNetworks to Visualize Vine Copula (Hanjo Odendaal) Currently the visual illustration of Vine copulae from the VineCopula package offers a bland plotting output. Combining the visNetwork html-widget along with the VineCopula RVM output, offers an interactive and visually appealing way to understand to interpret your output. R as a GIS: Introduction to spatial data visualization and manipulation (Jacques Booysen) This talk is an introduction to using spatial data in R, giving an overview as well as practical application of how spatial data can be created, manipulated and visualised using the R platform.It assumes no prior knowledge of spatial data analysis but prior understanding of the R command line would be beneficial. Practical applications will include: using the Google Elevation API with Google Maps Visualization using R, Spatial Interpolation/Modeling of Temperature Data in R and GIS climate change data on Amazon S3 using R. Energy planning for climate change using R, StarCluster and Shiny (Schalk Heunis) StarCluster is an open source grid computing framework that was used with R and Shiny to produce a geospatial visual interactive dashboard to inform energy planners about the long term impact of climate change on energy supply in Africa. StarCluster runs on Amazon EC2 and allowed horizontal scaling of MIP optimization. We were able to produce optimal energy solutions over thousands of future climate/socio-economic scenarios and decision. These solutions are then presented to decision makers who can interact with the dashboard to build intuition, understand risks and discover opportunities. This talk is about using horizontal scaling to deal with uncertainty in decision making using R, Shiny and StarCluster on Amazon EC2. Open Data for Better Decision Making (Matthew Adendorff) We live in the Data Age and now have the ability to ingest huge quantities of information to power predictions and insights. A caveat to this new-found computational capacity is that the adage, Garbage in, garbage out, is still as relevant as ever, if not more so. In addition, the distillation of meaningful information from the Internet's data fire-hose takes careful processing and reduction; and a well-manicured dataset is worth its weight in gold. Fortunately, in tandem with this current rise in access to information, the open data movement is rapidly approaching maturation and governments / civic society are now providing powerful tools and insight-rich data repositories free-of-charge to whomever wants to utilize them. The inclusion of such sources and technologies in modern Big Data infrastructure can provide a powerful platform for impactful analyses. In this talk I will present the potential that these open data sources present for modern computational pipelines, and will discuss some successful applications of this paradigm to South African challenges.
17:20	17:30	Break
17:30	18:00	Data Visualisation Challenge Ryan Nel

Venue

Nestled in the heart of Cape Town's V&A Waterfront, Workshop 17 is a bustling hub for start-ups and entrepreneurs alike.

Travel

Close to the City Center Workshop 17 is within easy reach of transport and accommodation.

Registration

Early Bird tickets now available! Offer ending on 2016-12-23*.

(*) No late or at the door registration available. Early Bird pricing ends on 2016-12-23. Standard price for Day Passes will be R200.00 and Training Passes will be R1500.00.

Training Pass

The training pass gives you access to all three workshops held by our Keynote Speakers. These workshops will be run one after the other on 16^th and 17^th February 2017.

All tickets for the workshops are sold out!

Tickets for the satRday conference on 18 February are still available here.

	Training Pass (Early Bird R1000.00)
2 Day Workshop
Lunch
Networking breaks with refreshments provided

Day Pass

The Day Pass gives you access to South Africa's first International R conference and the worlds second satRday. Come and join us on the 18^th February for this day dedicated to R.

	Day Pass (Early Bird R100.00)
Morning Tutorials
All Conference Talks
Lunch
Two networking breaks with refreshments provided

Important Dates

Working up to the conference on 18 February 2017, these are the most important dates on your calendar:

Event	Date
Early-Bird Registration Deadline	2016-12-23
Tutorial Submission Deadline	2017-01-20
Talk Submission Deadline	2017-01-20
Official Notification of Submission Acceptance	2017-01-25
Registration Deadline	2017-02-06
Visualisation Challenge Deadline	2017-02-14
Workshops	2017-02-16 / 2017-02-17
Conference	2017-02-18

Data Visualisation Challenge

Vantage Data and Tableau challenge you to bring data to life.

The Task

Develop a visualisation in R which exposes a captivating story hidden in the 2012 Causes of Death data.

The Prize

A Tableau Desktop Professional License (valued at R27 000.00).

The Sponsors

Vantage shows companies and individuals how data analytics goes beyond insight to uncover hidden value and patterns which can drive immense value. They partner with Tableau whose software helps people see and understand their data. Find out how they do it at www.vantagedata.co.za or discover Tableau for yourself at www.tableau.com.

The Data

The data for the Visualisation Challenge are the 2012 Causes of Death data gathered by Statistics South Africa and made available by Code for South Africa at https://bit.ly/2k7YP0I. Extensive metadata are also available. The data set contains 495 thousand deaths and provides a fertile source for a range of visualisations.

Guidelines for Participation

Only registered conference attendees are eligible.
R must be used to load, process and visualise the data. Any R package may be used.
Submissions can take the form of one or more plots or images combined as a single visualisation, dashboard or Shiny application.
Source code used to generate the submission must be released with an Open Source license.
Submit the following:
- a static copy of the visualisation (and a link to live copy if available);
- a short (maximum 1000 words) description of the visualisation;
- a link to the source code.

Submissions should be sent to satrday.cape.town+visualisation@gmail.com before midnight SAST 14 February 2017.

Visualisations will be presented and the winner will be announced at the end of satRday on 18 February 2017.