The Case For a Data Science Lab

This post originally appeared on the Mango Solutions blog.

TL;DR It’s basically my take on remaining innovative when data science goes corporate.

As more and more Data Science moves from individuals working alone, with small data sets on their laptops, to more productionised, or analytically mature settings, an increasing number of restrictions are being placed on Data Scientists in the workplace.

Perhaps, your organisation has standardised on a particular version of Python or R, or perhaps you’re using a limited subset of all available big data tools. This sort of standardisation can be incredibly empowering for the business. It ensures all analysts are working with a common set of tools and allows analyses to be run anywhere across the organisation It doesn’t matter if it’s a laptop, server, or a large-scale cluster, Data Scientists and the wider business, can be safe in the knowledge that the versions of your analytic tools are the same in each environment.

While incredibly useful for the business, this can, at times, feel very restricting for the individual Data Scientist. Maybe you want to try a new package that isn’t available for your ‘official’ version of R, or you want to try a new tool or technique that hasn’t made it into your officially supported environment yet. In all of these instances a Data Science Lab or Analytic Lab environment can prove invaluable to maintain pace with the fast paced data science world outside of your organisation.

An effective lab environment should be designed from the ground up to support innovation, both with new tools as well as new techniques and approaches. For the most part it’s rare that any two labs would be the same from one organisation to the next, however, the principles behind the implementation and operation are universal. The lab should provide a sandbox of sorts, where Data Scientists can work to improve what they do currently, as well as prepare for the challenges of tomorrow. A well implemented lab can be a source of immense value to it’s users as it can be a space for continual professional development. The benefits to the business however, can be even greater. By giving your Data Scientists the opportunity to be a part of driving requirements for your future analytic solutions, and with those solutions based on solid foundations derived from experiments and testing performed in the lab, the business can achieve and maintain true analytic maturity and meet new analytic challenges head-on.

In order to successfully implement a lab in your business, you must first establish the need. If your Data Scientists are using whatever tools are handy and nobody has a decent grasp on what tools are used, with what additional libraries, and at what versions, then you have bigger fish to fry right now and should come back when that’s sorted out!

If your business analytic landscape is well understood and documented, you must first identify and distil your existing tool set into a set of core tools. As these tools constitute the day-to-day analytic workhorses of your business, they will form the backbone of the lab. In a lot of cases, this may be a particular Hadoop distribution and version, or perhaps a particular version of python with scikit-learn and numpy, or a combination.

The next step, can often be the most challenging, as it often requires moving outside of the Data Science or Advanced Analytics team and working closely with your IT department in order to provision environments upon which the lab will be based. Naturally, if you’re lucky enough to have a suitable Data Engineer or DataOps professional on your team then you may avoid this requirement. A lot of that is going to depend on the agility model of you business and how reliant on strict silos it is.

Ideally any environments provisioned at this stage should be capable of being rapidly re-provisioned and re-purposed as needs arise, so working with a modern infrastructure is a high priority. It’s often wise at this stage to consider some form of image management for containers or VM’s, to speed deployment and ensure environments are properly managed. You need to be able to adapt the environment to the changing needs of the user base with the minimum of effort and fuss.

Once you have rapidly deployable environments at your disposal, you’re ready to start work. What form that work takes should be left largely up to your Data Science team, but broadly speaking they should be free to use and evaluate new tools or approaches. Remember, the lab is not a place where production work is done with ad hoc tools, it’s a safe space for experimentation and innovation, just like a real laboratory environment. Using the knowledge gained from running tests or trials in the lab however, can and should inform the evolution of your production tools and techniques.

A final word of warning for the business: A successful lab environment can’t be achieved through lip-service. The business must set aside time for Analysts or Data Scientists to develop the future analytic solutions that are increasingly becoming central to the success of the modern business.