Streamlining the Data Scientists Workflow

By Jyoti Nigania |Email | Aug 8, 2018 | 16368 Views

Data Scientists are big data wranglers. They consider mass and messy data both structured and unstructured and use their formidable skills and organize them accordingly. They use or apply their analytics power to uncover hidden solutions to business challenges. Following are the two points which clearly explain the inefficiencies in the data scientist's workflow:

1. Cleaning data and understanding its nuances.
2. Collaborating the final results with others.
1. Cleaning data and understanding its nuances:
Data should be cleaned in such a way that it permits everyone to get more insights about the world and that should be incredibly time consuming. This is quite not easy as we are thinking because the data generating processes are typically messy, and nobody can't predict typically and even don't observe this directly or have full control over the process that generates data.
Following are the pointers that explain the inefficiencies in a Data Scientist's workflow:
  • Data visualization: No one spent too much time visualizing and exploring data. The human visual system is powerful for detecting patterns and identifying oddities, visualizing data leverages that. And they have found becoming really fluent in using it empowering.
  • Outlier detection: Manually inspecting a small number of individual data points helps you get a feel for the typical structure of the data and oddities present within it. Leveraging outlier detection helps focus your attention on potentially concerning or enlightening oddities.
  • Data pipelines: Your goal should be to have a single command that causes your end-to-end workflow to run, starting from the inputs you or your team is given and ending in the outputs. This removes a lot of room for errors that you can control by making the workflow explicitly defined and intermediate results inspect able and the code for the workflow should live in version control.
  • Model inspection:  Black box model inspection e.g. variable importance and partial plots can help gain confidence in a model or raise red flags and highlight input features that merit closer inspection. Many broken models are actually the result of data cleaning errors and oddities that weren't detected earlier in the process. White box model inspection e.g. coefficients of a small linear model, structure of a small decision tree, which input patterns, maximize an intermediate neuron in a neural network may be helpful here as well.
  • Change detection: Detecting and alerting changes in input data distributions and/or model outputs can provide an early warning for input data issues that need to be cleaned up.
At Kaggle, we're building Kaggle Datasets to get access to the very high quality of data and develop communities around it and thus improving them. This will help to reduce some of the pain around cleaning data by enabling any users to document datasets from high-level descriptions to what individual columns mean, seeing others users code and questions about the data, and facilitating collaboration on the data. Our goal to minimize the number of data scientists that need to feel the pain of cleaning an individual dataset, and maximizing those that can benefit from the cleaned datasets. 

2. Collaborating the final results with other:
Collaborating the final findings with other is quite difficult. So everyone should evaluate itself for their future. There is a silver bullet here reproducibility while most of the data science workflows aren't designed to be reproducible, so there is no as such good reason for the findings. 
Older tools get much of the way there. Make facilitates chaining workflow steps together into pipelines that can be run end-to-end. Pip requirement files facilitate using the same Python libraries. Git facilitates sharing and collaborating on code. New technologies like Docker containers are a powerful and currently underutilized enabler to get even closer to full reproducibility.
The sad status quo is most analytics work is done only on a local machine, with only the outputs such as graphs, intermediate data, and trained models) shared. It doesn't need to be these way small steps such as setting up your environment with an end-to-end workflow and reproducibility in mind, committing to version control, and linking back to the source data/code when you share results make collaboration far easier.

According to Kaggle Kernels, at Kaggle, we're building out a software product targeted directly at reproducibility.It combines versioned code, versioned data, and a versioned computational environment through Docker containers to create a reproducible result. Right now, it's well-suited for publicly sharing analytic insights and visualizations that fit in a single code file or Jupyter notebook and a single machine. Down the road, we're aiming to support the vast majority of data science use cases and as close as possible to complete reproducibility.

Source: HOB