What Makes an Efficient Data Science Workflow?
An efficient workflow should do just that -- flow -- directing us seamlessly from each phase of a project to the next, optimizing task management, and ultimately guiding us from business problem to solution to value. As the data deluge continues to rain down, businesses are drowning in data but starving for insight. This makes the hiring of a data science team a vital investment. But what makes up a data science team? What are the best practices for data science workflows? And what do data scientists need to execute their data science workflow to the best of their ability?
While there is no template for solving data science problems, the OSEMN (Obtain, Scrub, Explore, Model, Interpret) data science pipeline, a popular framework introduced by data scientists Hilary Mason and Chris Wiggins in 2010, is a good place to start. Most data science workflows are variations of the OSEMN sequence of steps, having fundamental processes based on the same established principles, and with the common goal of enabling the rest of the organization to make better, data-driven decisions. The features of data science workflows depend entirely on the business goals and task at hand.
The most important step in improving your data science workflow is the development of best practices for your team’s particular needs. In doing this, you’ll want to consider the following data science workflow best practices.
Data Science as a Team Sport
The initial perception of data scientists was of one person who could just magically do everything. For obvious reasons, that's not a good idea. Data science encompasses a wide variety of disciplines and roles, including programming engineers, machine learning engineers, system architects, database administrators, business intelligence analysts, IT engineers, and more. Building data science teams should encompass individuals who will specialize in different areas. An effective team workflow starts with determining the kind of expertise needed on your team, and clearly defining the roles within your team.
If you’re building a prototype, you might not need a systems architect. A database administrator might not be necessary if you're working on a smaller project. A production engineer would be best suited for customer facing services. And some team members with experience from academia will mainly perform research that is not necessarily intended to result in a product for sale. The various roles on your data science team are determined by your business goals and tasks. The data scientist is not a one-man-band, and can often be overvalued. Having all these specialists work together towards a common goal is going to help you get farther than having a few individuals trying to do everything themselves.
Identifying Your Business Questions
What question are you answering and what are the business goals? A major component in data scientists’ productivity is the ability to break big problems down into smaller pieces, and to really focus on the business outcome that you're trying to solve, as opposed to doing research for research’s sake. Ultimately, data science teams exist to improve a business process, increase revenue, and lower costs. The ability to ask the right questions and actually solve real business problems determines your success. Identifying the abstract sets the agenda for what you want your team to accomplish. Who is your end user? What is their problem? What are you prioritizing -- accuracy, speed, or explainability?
Embracing Open Source and the Cloud
The cost prohibitive aspects associated with early data science workflows have effectively been eliminated thanks to open source data analysis solutions and cloud computing. Open source has evolved to become the predominant source of tools for data scientists. In terms of conceptual access, you won’t be required to build your own data center. If you want to use a variety of different tools, you now have the option to test them out and subscribe to them on an as-needed basis. And cloud computing provides large amounts of hardware that can be rented on an hourly basis.
There's also generally no explicit cost for using open source libraries, which provide incredible resources and flexibility. Unlike proprietary software, an open source project can be modified to suit your needs. Building on an existing project eliminates the need to start from scratch, saving an enormous amount of time and money. Switching costs should be lower as well without any actual licensing cost. With open source in combination with cloud computing, you can evaluate what you want to use, create a prototype, test it out for a period of time, determine what doesn't work, and then try something else, all at a much lower cost.
Building the Right Data Science Workflow Toolkit
The bulk of a data scientist’s time is spent understanding the business problem and communicating the results. Documenting and communicating your findings in a clear and efficient way can be one of the most challenging steps in the scientific process. Automating this process is crucial for good data science workflows and for your sanity. Some useful data science workflow tools include:
Data Science Workflows with Jupyter
Jupyter Notebook is an open source, data science front end used to capture the data preparation process, consisting of notebooks that contain live code, equations, visualizations and explanatory text. Jupyter Notebook works irrespective of whether you're using a laptop, a server, or with cloud providers. The notebook aspect of it refers to the fact that you have your code and the results in the same window. As a means of communication and interactive exploration, Jupyter Notebooks have a very desirable set of properties for the interface, in which you can add little bits of code at a time, see the result, write corresponding notes to yourself on your data sources and conclusions, and then send those files to other people. In order for these notebooks to work, you need the data and all the dependencies that are used to reproduce this data, which is where docker containers come in.
Data Science Workflows Using Docker Containers
With Docker, you can package all your code, and everything you need to run the code, in standardized, isolated software containers that can be passed into and work in any environment.
Data Science Workflows with RAPIDS
RAPIDS is an open source suite of GPU accelerated machine learning and data analytics libraries deployed on NVIDIA GPU platforms. RAPIDS is ideal for teams that are solving larger scale problems, need millisecond response times, or executing large volumes of repeated computation.
Data Science Workflows with Amazon Web Services
Amazon Web Services offers a suite of data science tools well-suited for machine learning workflows. Orchestrate and automate sequences of machine learning tasks by enabling data collection and transformation. Use Amazon Athena to perform queries, aggregate and prepare data in AWS Glue, execute model training on Amazon SageMaker, and deploy the model to the production environment. Data science workflows can be shared between data engineers and data scientists.
Machine Learning and Networking
Machine learning and artificial intelligence, often used interchangeably for business purposes, are ideal for solving business problems that demand an accurate answer without necessarily needing an explainable answer. For example, in a ride sharing app, if you're just trying to predict how many users are going to be in a given part of the city or how many vehicles need to be there, you don't necessarily care about the Why -- you just want to get the most accurate number.
The best resources for automated machine learning and deep learning workflows are, in the spirit of open source, other data scientists. Networking with other data scientists, reading the content they’re publishing, evaluating other feature engineering projects and how they were solved, seeing what other people are doing, trying to improve upon it and adapting their technique is far more effective than relying on any one book, tool, blog post(!) or person to improve your machine learning workflows.
Efficiency -- Newer Isn’t Necessarily Better
Trying to chase the newest thing may be damaging your data science workflow efficiency.
Most data science projects won’t require a cutting edge approach. Spending too much time worrying about what the cutting edge is, versus doing something that's well understood that might get you 99 percent of the results, may land you in a cycle of endless research with no clear solution. In most business cases, it’s better to get more things done than chase the last two percent increase in accuracy.
Reproducibility is a problem that is very important, but also very hard to prove. The whole goal of reproducibility is to say: this is the data I used, this is the code I used, and if you do the same exact thing, you'll get the same exact answer. There are still significant challenges in reproducibility in the field of data science. Even if you can use version control for the code that you've written, if you don't necessarily write down every library dependency that you have, the open source library that you use could change. It's also very difficult to do version control when conducting big data analytics at enormous scale. The lack of infrastructure to make copies of these enormous datasets results in a single copy that is vulnerable to alterations.
The safest course of action is to use Git version-control, write down all the packages that you were using, version all of your code and, at the very least, you can follow the thoughts of the creator and hopefully you can have a copy of the data set.
Python and R?
The best language for data science workflows is...it depends. R and Python are high level languages that both have their strengths for data science projects. The packages for R and for Python often have a lower layer where the computation is done in a very fast language, such as C++ and Fortran. The difference tends to lie in application. Where R is more of an academic, research-based, statistician’s language, Python is more for science research, data science, building applications, and production engineering. Python may be preferable for data science workflows as it is generally considered to be faster, better for data manipulation, and is inherently object-oriented. R may be more difficult to learn, but it is generally considered better for ad hoc analysis. Data science workflows in R and data science workflows in Python both have merits, and there may be some value in shifting the conversation from ‘R or Python’ to ‘R and Python’ for data science projects.
The OmniSci Advantage for Data Science Workflows
OmniSci was built on the foundation of GPU acceleration, targeting extremely high performance in its analytics platform from its inception, and Immerse was born out of that obsession. What Immerse gives you is the ability to look at much larger volumes of data than you could in the past and visualize them, executing not only the computation on GPU, but also rendering graphics. In terms of the scale of problems that you can work on, especially around geospatial data, OmniSci has the advantage there because all hardware is being used to its full capability, for math, for pictures -- the full spectrum.
The desire to gain insight from data shows no signs of slowing. As the demand for data scientists increases at an astounding rate, so too does the importance of supporting your data science team and developing a solid data science workflow. Data science is an art, and with a properly equipped, inspired team, any project can be transformed into a valuable and compelling story.