Docker and Enabling Analytic Workflows

This post originally appeared on the Mango Solutions blog

It’s been a few months now since I presented on the ways in which Mango are using Docker at EARL 2014 and a lot’s happened since then, so I thought I take this opportunity to post a brief overview of the current state of Docker in the R/Data Science communities.

For those that don’t know about Docker, it’s a container for your applications. It’s designed to provide an isolated, portable and repeatable wrapper around your applications. You build your application once, inside the container along with all your dependencies, and you can take that same container and run it anywhere that Docker runs. Docker is a native Linux application but the Docker team ship a tool called ‘boot2docker’ in Windows and Mac OS X flavours.

At EARL I talked primarily about large, industrial scale analytics uses for Docker; using Docker to deploy multiple stateless analytics processing nodes as a back-end service for an application. Whilst these types of uses continue to gain ground over traditional server based deployments, another use for re-usable containers has made massive strides in the meantime.

At around the same time as EARL 2014, a project to provide R on Docker was taking shape on the other side of the Atlantic. On the 25th of September 2014, Dirk Eddelbuettel gave a talk on using R and Docker at a meetup of the Docker Chicago group and on the 2nd of October, Carl Boettiger published his paper on using Docker for reproducible research. The two had been collaborating on using R with Docker for some time and the fruits of that collaboration finally coalesced into a project called ‘Rocker‘, which quickly became the official R image for Docker.

Although I don’t think the Docker team were thinking about things like reproducible research when they were developing the software, the impact in this area is potentially huge. It’s often extremely difficult to reproduce pieces of research as the researchers’ focus is rightly on the task at hand, rather than the tools they’re using and the environment in which they are working. This can make it extremely difficult for others to reproduce that work, as there may be differences in software, package or library versions which could affect the outcome of any computation.

Enter Rocker. The Rocker team maintain several different Docker images, including one with a full install of RStudio Server and the ‘Hadleyverse’ packages (Hadley Wickham’s packages, such as ggplot2, plyr, dplyr and reshape2). Having RStudio Server baked into the Docker image, provides a simple way for a analyst/data scientist to use the web based IDE on their local machine. As Docker containers are also versioned, the analyst/data scientist is able to share their research with others and not have to worry about versions of R and package versions and so on. As long as the version of the Rocker image is also shared, anyone wishing to reproduce the research is free to use that very same Rocker image to do so.

Docker is evolving quickly and uptake continues to gain speed. Previously Docker users were primarily interested in large scale data centre deployments, where its portability and repeatability are a major boon. Now, users with more modest computational requirements are also taking advantage of these features on their desktops and laptops. What’s perhaps most exciting here though, is the increasing ease with which Docker and the tools built upon it, can have a foot in each camp. The same skills which would allow an analyst to run the RStudio Server Rocker image on their local machine with 4 CPU cores and 8G of RAM, are more or less the same as the ones that would allow that same analyst, to run exactly the same RStudio Server image on a cloud server with 32 cores and a quarter of a terrabyte of RAM.

As for Mango, we have been working with Docker almost since its inception, initially performing a series of technical spikes evaluating Docker as a solution for rolling pre-validated images onto cloud environments. Since then, Mango have successfully used Docker for a variety of projects from providing consistent and easily reproducing data science environments to large scale analytics pipelines. We’re even starting to use Docker with test automation software in order to test our applications on mixtures of technical platforms in parallel.

We believe that Docker’s flexibility makes it a great fit for a variety of tasks, including the generation of scalable analytic workflows. Whatever your needs are at the moment, if those needs are likely to change over time, it’s definitely worth taking a closer look at Docker.

Footnote: If you’re specifically interested in R and reproducibility, but don’t really want to move outside of R, it’s worth taking a look at GRANbase and switchr and RSudio’s Packrat.

Monday, January 26, 2015