On a given project, data scientists can spend upwards of 80% of their time preparing, cleaning, and correcting data. In this session, we will look at different data cleansing and preparation techniques using both SQL Server and R. We will investigate the concept of tidy data and see how we can use tools in both languages to simplify research and analysis of a small but realistic data set.


The slides are available as a GitPitch slide deck.

You can also get a version of the slides in HTML 5 format.

The slides are licensed under Creative Commons Attribution-ShareAlike.

Demo Code

The demonstration code is available on my GitHub repository. This includes all of the SQL and R code, as well as data sources used in demos. This also includes a notebook for tidyr.

The source code is licensed under the terms offered by the GPL. The slides are licensed under Creative Commons Attribution-ShareAlike.

Additional Media

On August 16, 2017, I gave a version of this talk at NDC Sydney. You can get the recording on the NDC Youtube channel.

Links And Further Information

Data Cleansing With SQL Server

Data Quality Services

Although I do not go into Data Quality Services in my talk, I consider it an important next step for promoting higher-quality data for analysis.

Data Cleansing With R