Docker For Data Science: An Introduction

What did people use before Docker? I honestly do not know. I also don’t think that it is useful knowledge. Knowing what Docker is and how you can use it to take your code to the next level is, however, useful knowledge. This blog post is a beginner’s guide to Docker, it is, however, a more advanced topic as far as Data Science is concerned. I do explain certain concepts but not in detail as I expect the reader to have a good working knowledge of the R programming language, notebooks, and version control. If the reader does need more understanding of a concept I have provided a few links to assist.

Motivation

When I started as a Data Scientist my focus was always on finding the best algorithm by conducting experiments that would improve my model’s performance. I never really thought much about how I would share my findings and code with others, other than “find the link to the GitHub repository below”. Whilst this is somewhat OK, I experienced a lot of challenges myself when trying to reproduce other people’s work. Most of this was the environment setup. In many cases, it would be time-consuming and I would give up and get on a call with the main Developer.

Docker is a tool that simplifies environment creation (you can run the same code on multiple machines with ease).

Scope of the Tutorial

Data Scientists often communicate their findings with notebooks. Popular notebooks include Jupyter and R markdown. Notebooks allow you to integrate code and text, you can create them in such a way that a non-technical audience can also interact with your work. To get more of an idea you can check out my Rpubs. Therefore I thought it would be valuable to demonstrate how you can:

  • Add a R markdown document to a docker image
  • Run the Docker image and produce an output
  • Share the image on a public repository so that others can use it
  • Integrate GitHub and Docker Hub to automatically manage changes (anyone who pulls the image will always be using the latest version)

Note this is the first of the series “Docker for Data Science”. I plan to go into more detail with other concepts that I will only go through briefly in this introduction. I have decided to only write the theory later because it is always more motivating when you can see some results. I also understand that when you are not a Software Engineer Docker sounds very intimidating. For this reason, I am creating this blog series to only focus on Docker for Data Science. This really makes the scope less complex trust me. I believe that Data Scientists do need to know how to incorporate Docker into their workflow so that they can test the reproducibility of their solution locally before handing it over to DevOps. 

Let’s get started …

Setup Docker and GitHub

Before you can start using Docker you need to:

  • Create a Docker Hub account here
  • Follow the instructions and install Docker Desktop for Mac, Windows or Linux here (note: I use Windows)

To verify that your installation has been successful, in your windows command prompt or Windows PowerShell ISE run the command:

docker -v

You should get an output similar to:

Docker version 19.03.12, build 48a66213fe

For reproducibility, you always need to begin with version control in mind. Therefore make sure you have GitHub Desktop installed, or Git integrated into your R workflow. GitHub is a code-sharing platform that simplifies the process of managing code updates. It is essential to have a working understanding of Git, so I have provided a link of resources that can help you get more familiar with Git:

  • GitHub Desktop is beginner friendly you can find a good resource to help understand the workflow here.
  • I prefer to have Git integrated with my R studio. You can learn how to do that by going through Happy Git and GitHub for the useR .
  • I recommend starting by understanding the basic Git commands such as clone, pull, commit, and push bore taking it further. However, a good resource for those that would like to get more advanced knowledge of Git can be accessed here.

Folder Structure

If your project is using various files such as scripts, markdowns, data, well even a shiny app it is best practice to organize these into folders. Good news. I have already done this for you. The folders that we will be using for this tutorial are available here in the docker-r-myproject repository. You will know you are in the right place if the repository looks like figure one below.

Figure one: Docker R project repository

You can clone the repository using GitHub Desktop (or whichever method suits you best) so that you can work on the project locally.

If you would like to learn more about folder structures I would recommend you get familiar with R projects. R projects provide a good guideline for keeping track of the various files and folders that are part of a project (scripts, data, figures) and organizing these in a manageable way.

In the docker-r-myproject, you will notice the following items; markdown, data, R, Dockerfile, output, and README.md. Markdowns integrate code and text and it is a way to communicate results to a non-technical audience. From markdown (report.Rmd) we want to produce a report in HTML format.

The markdown uses data stored in the data file (i.e combined.csv) to produce plots. Below is the sample code. This code snippet reads data and plots a histogram of the dependent variable fare_amount. The data is read using functions from the R package data.table, the plot is made using ggplot2. These packages are dependencies. Dependencies are managed in a separate script. This script is stored in the folder called R and is called install_packages.R.

```{r message=FALSE, warning=FALSE, paged.print=TRUE, echo = FALSE}
#read the data
dat <- fread("/data/combined.csv")
dat <- dat[,-1]
#dependent variable analysis
p1 <- ggplot(data = dat, aes(dat$fare_amount)) + 
  geom_histogram()
p1
```

The install_packages.R must contain all the package dependencies needed to successfully run report.Rmd. All R scripts are stored in the R folder. You will notice that there is an additional script called script2run.R, this is required in order to render the markdown file.

The output folder is where we want to store the results of the analysis, in this case, the HTML report that will be produced by report.Rmd .

README.md is a markdown document used to document the process on how to execute the project docker-r-myproject.

The most import file for this post is the Dockerfile. The Dockerfile is a set of instructions written in a language that a computer understands so that it can build the R environment necessary for your R program to run successfully. The environment built is known as a container and it can be run on any machine with just one command. Thereby making it easy for anyone to reproduce your work with just one command. Sounds incredible right? Well read on and let’s see…

Build Docker Image

We will run all Docker commands in the Windows command line. The first step is to build the Docker image.

NB: when building the image always remember to add the period ‘ . ‘ at the end. Make sure to change the file paths to match that of your computer.

# Go to location of application folder
cd :\location
# example
cd C:\Users\samzeecodes\Documents\GitHub\docker-r-myproject
# Build Docker image and Give image a name
docker build –t <Docker_Image_Name> . 
# example
docker build –t docker-r-myproject . 

Run Image Locally

The only key thing to note here is how the mapping of computer folders to the container file system using -v and ‘ : ‘ is done. Note, the computer file system content is not stored in the image. I will speak more about this in an upcoming blog post.

# Run docker image
docker run -it --rm -v <local output folder path to map to> :/output <Docker_Image_Name>
#example
docker run -it --rm -v /c/Users/samzeecodes/Documents/GitHub/docker-r-myproject/output:/output docker-r-myproject
# --rm automatically removes container when it stops

report.html should now be present in your output folder.

To cancel and stop the container you can use the command ctrl+c, to verify that the container is no longer running you can run docker ps. The list should be empty. If it is not, you can manually stop containers with the command docker stop <container ID>.

Push Image to Docker Hub

Docker Hub is a public repository where developers can save their images. Saving your image to Docker Hub allows others to re-use the same image with ease. For example, in our Dockerfile we use the base image rocker/verse:latest and build on it. In this section, we will save the image on Docker Hub and add a tag. Tags are useful and enable you to specify the docker version. Make sure to create a repository in Docker Hub for the image.

# Log in Docker hub
docker login 
# Re-tag local image to Docker hub username / repository name : tag
docker tag <local image name> <docker hub username>/< repository name>:<tag image>
#example
docker tag docker-r-myproject szambezi/myfirstproject:latest
# Push image to Docker hub
docker push <docker hub username>/< repository name> : <tag image>
#example
docker push szambezi/myfirstproject:latest

You should be able to view the Docker image under tags in the Docker Hub repository like below:

Docker Hub Image Save Example

A Video Clip That Puts It all Together

If you have not been able to follow the written tutorial so far. Do not worry there is a video , you can watch it below:

Setup Automated Build for the Image

In this section, we will use a feature of Docker Hub called build and link our GitHub account and repository to Docker Hub. This will enable any changes that you make to the image locally and push to GitHub and rebuild automatically. The best way to demonstrate this is with a video tutorial so I have attached one below.

Setup Automated Build for the Image

You can verify the auto-build by making some changes locally e.g add a new package in the install_packages.R script, push the change to Git, then go to Docker Hub > Builds. You should see that the image has successfully been scheduled for a build.

Next

In next blog post that is part of this series I will explain the Docker commands in the Dockerfile in more detail. The goal of the series to develop an end-to-end machine learning pipeline using Docker.

Liked it? Take a second to support Samantha Van Der Merwe on Patreon!

One thought on “Docker For Data Science: An Introduction

Leave a Reply to 360DigiTMGIN Cancel reply