I've been dying to test drive one of the many recent tools on the market targeted at the citizen data scientists like DataRobot, H20 Driverless AI and Microsoft's new product in the cloud called Microsoft Azure Machine Learning Studio (Studio). These tools promise to accelerate the time to value of data science projects by simplifying machine learning model construction. Ultimately, this will allow data engineers, programmers, business analysts and others without PhDs to start chipping away at the massive modeling opportunity companies are eager to tap into but are limited in their ability to address due to data science skills shortage.
So I opened up an Azure account and spent some hours building a few machine learning models from the ground up using their sample data. I'll describe my experiences here to show you how easy this tool really is to use in the hopes that others can quickly grasp its strengths and weakness. I think I am a representative candidate to be conducting this review as I am not a working data scientist today. I, however, am a graduate student in data science at UC Berkeley, have a CS degree, have taken several graduate level statistics and machine learning courses and can program in Python.
Let me start out by saying, I really, really... like Microsoft Azure Machine Learning Studio. It makes the process of doing data science work, that is building, testing and deploying a predictive model for your data much easier and visual for both those getting started and for more experienced data science users. Studio clarifies the whole process by visually walking you through thinking about your data sources, connecting the data to potential model algorithm candidates, doing data cleansing and transformations, choosing features, training the model, testing it, selecting the best model and even deploying your new shiny working machine learning model as a web service in Azure for others to use. In the end, you are left with both a working model accessible via APIs and a visual, documented representation of your model for others to see and for you to continue to tune. Wow!
As you will also see, Studio is so easy to use, it makes data science seem almost deceptively simple. The studio does make the process simpler for sure, but just like Reese's Peanut Butter cups take two great tastes to make America's best-selling candy (according to data by Nielsen), you need both a simpler process of model construction (the chocolate) and the peanut butter inside. And that peanut butter is a healthy dose of statistical knowledge for feature determination, model selection, and interpretation and even some programming skill (for more sophisticated data adaptations).
My First Experiment
To take the product for a test drive, I just jumped right in and followed some well thought through recipes Microsoft has for creating models and didn't bother reading documentation until I needed to. You can create a machine learning modeling experiment from scratch, or you can use an existing sample experiment as a template form the Azure AI Gallery. For more information, see: Copy example experiments to create new machine learning experiments. We will walk through the process of creating an experiment from scratch.
My first model experiment was very simple and used data from one of the 39 data sets supplied from UC Irvine, Amazon, IMDB, etc‚?¶ It is a linear regression model to predict car prices based upon different variables such as make and technical specifications.
You enter Studio in the interactive workspace. To develop a predictive analysis model, you will use data from one or more sources, transform and analyze that data through various data manipulation and statistical functions, and generate a set of results. Developing a model with Studio is an iterative process. As you modify the various functions and parameters, your results converge until you are satisfied that you have a trained, effective model by evaluating its score results.
Azure Machine Learning Studio is beautifully interactive and visual. You drag-and-drop datasets and analysis modules onto an interactive canvas, connecting them together to form an experiment, which you run in Machine Learning Studio. To iterate on your model design, you edit the experiment, save a copy if desired, and run it again. When you're ready, you can convert your training experiment to a predictive experiment, and then publish it as an Azure web service API so that your model can be accessed by others.
To get started, I first went to Azure Machine Learning Studio at https://studio.azureml.net/ where I was asked to sign in using a Microsoft account, work or school account. Once signed in, you get to a home page which looks like this.
The basic layout is represented in the following tabs on the left:
PROJECTS‚??-‚??Collections of experiments, datasets, notebooks, and other resources representing a single project
EXPERIMENTS‚??-‚??Experiments that you have created or saved
WEB SERVICES‚??-‚??Web services models that you have deployed from your experiments
NOTEBOOKS‚??-‚??Jupyter notebooks that you have created
DATASETS‚??-‚??Datasets that you have uploaded into Studio
TRAINED MODELS‚??-‚??Models that you have trained in experiments and saved
SETTINGS‚??-‚??A collection of settings that you can use to configure your account and resources.
At the top level, the recommended workflow for conducting an experiment and ultimately publishing it as a web service is as follows:
1. Create a model
Get the data
Prepare the data
2. Train the model
Choose and apply an algorithm
3. Score and test the model
Predict new automobile gas prices
4. Publish the model as a cloud service
Get the Data
Create a new experiment by clicking +NEW at the bottom of the Machine Learning Studio window. Select EXPERIMENT > Blank Experiment and I named the experiment Automobile Price Prediction. There are many other pre-built experiments you can choose from but I chose this as a first look.
To the left of the experiment, the canvas is a palette of sample datasets and modules which you can search. I chose the dataset labeled Automobile price data (Raw) and then dragged this dataset to the experiment canvas. Of course, Studio supports uploading your own dataset in many formats too.
One really nice feature that data scientists appreciate is the ability to get a quick look at the data columns and distribution to understand the data we are dealing with. To see what this data looks like, you can simply click the output port at the bottom of the dataset, then select Visualize.
Datasets and modules have input and output ports represented by small circles‚??-‚??input ports at the top, output ports at the bottom. To create a flow of data through your experiment, you'll connect an output port of one module to an input port of another. At any time, you can click the output port of a dataset or module to see what the data looks like at that point in the data flow.
In this dataset, each row represents an automobile, and the variables associated with each automobile appear as columns. We'll predict the price in far-right column (column 26, titled "price") using the variables for a specific automobile. Note the histograms of each column that are given and the details on the distribution of the data in the right pane. This quick look seemed more time-consuming to find in other tools I have used.
Prepare the data
As any experienced data scientists know, datasets usually require some preprocessing before they can be analyzed. In this case, there are missing values present in the columns of various rows. These missing values need to be cleaned so the model can analyze the data correctly. We'll remove any rows that have missing values. Also, the normalized-losses column has a large proportion of missing values, so we'll exclude that column from the model altogether.
The studio makes this process very easy. They supply a module that removes the normalized-losses column completely (Select Columns in Dataset) and then we'll add another module that removes any row that has missing data.
First, we type in "select columns" in the search bar to the left and drag onto the canvas the Select Columns in Dataset module. Then, we connect the output port of the Automobile price data (Raw) dataset to the input port of the Select Columns in Dataset module by simply clicking and drawing a line between the two dots.
By clicking on the Select Columns in Dataset module we Launch the column selector in the Properties pane of this module. By using the WITH RULES and begin with ALL COLUMNS settings, with a few steps, we can exclude a column name, in this case, the normalized-losses column and the module will still pass through all other columns. Studio lets you double-click to add a comment to a module by entering text so you see at a glance what the module is doing in your experiment. In this case, I added the comment "Exclude Normalized Losses."
Similarly, to remove rows with missing data, drag the Clean Missing Data module to the experiment canvas and connect it to the Select Columns in Dataset module. In the Properties pane, select Remove entire row under Cleaning mode. These options direct Clean Missing Data to clean the data by removing rows that have any missing values. I then double-click the module and type the comment "Remove Missing Value Rows."
Defining features simply means that we will select the columns (features) that we will use in the model to predict the price. Defining features requires experimentation as some features will have more predictive power than others. Some features will be highly correlated with other features and therefore not add to the predictive power of the model and these features should not be included to make the model as parsimonious as possible. A parsimonious model is a model that accomplishes the desired level of explanation or prediction with as few predictor variables as possible.
For our walk-through, we'll keep to Microsoft's example and assume a subset of features that may allow us to predict price:
To add these features we drag the Select Columns in Dataset module to the canvas and connect the output of the Clean Missing Data column to its input. We double click the module and type Select Features for Prediction as our descriptor. Next, click Launch column selector in the Properties pane and select with rules. We can begin with No Columns and one by one we add the column names (features) to the model's list. Click the check mark (OK) button when done. This module produces the filtered dataset of only those features (and associated data) that we want to pass to the learning algorithm that we will add next.
This sure sounds a lot simpler than the reality and complexity of optimal feature selection when you start to dig into the documentation and the data science behind it all. And in a first walk-through, we do want it to be simple so we can experience the overall flow of building a model. But let me just give you some insights so readers won't walk away assuming this is really easy and that they should roll out Studio to every analyst in the company right away.
As you work with this product more, you'll find that there are modules that you should use to select features and that this should be a step in your process flow after cleaning the data. The studio provides these modules for feature selection:
Filter-Based Feature Selection: Identifies the features in a dataset that have the greatest predictive power.
Fisher Linear Discriminant Analysis: Identifies the linear combination of feature variables that can best group data into separate classes.
Permutation Feature Importance: Computes the permutation feature importance scores of feature variables for a trained model and test dataset.
Microsoft includes this article about the feature selection modules and how to use them if you want to see more.
As a real-life story of how feature selection complexity can play out in modeling, here is an example of feature selection done by two enterprise architects at Microsoft's Data Insight's Center of Excellence. It describes the journey they went through to migrate their Excel linear regression model for financial forecasts to Studio. They ultimately succeeded in getting better forecasts from Studio and were able to publish the model as a web service making it more accessible‚??-‚??three cheers for Studio!
But, keep in mind they had some learning and tweaking to do by understanding whether to use "Online Gradient Descent" or "Ordinary Least Squares" methods of regression and also discovered that they needed to tweak the L2 Regularization Weight depending on their data set size. They also described how they began using theFilter-Based Feature Selection to improve on their selection of initial data elements and that they also intended to test additional algorithms like Bayesian or Boosted Decision Trees to compare performance with linear regression which we will discuss in the next section. It takes time, testing and data science training to really produce the best results‚??-‚??even with a simple tool like Studio.
One area of improvement I would like to see is more help from Studio in this area of automatically selecting the optimal features. The terminology above does not sound like what citizen data scientists would understand. I did discover that some of the machine learning algorithms (discussed in the next section) in Studio do use feature selection or dimensionality reduction as part of the training process. When you use these algorithms, you can skip the feature selection process and let the algorithm decide the best inputs which are a step in the right direction. But knowing which algorithms do this and which don't as well as actually performing feature selection will be a difficult, time-consuming process for citizen data scientists who may not understand how to best select features.
Training the model‚??-‚??Chose and Apply an Algorithm
Now we are ready to train the model. In this example, we are doing what is called supervised machine learning and there are many algorithms available that may offer the predictive power we seek. For example, there are classification models that will predict which category a subject row is likely to be in (is it a car or a truck?), there are regression algorithms that predict a numeric answer (like future stock price). There are 25 types of models baked into MLS for anomaly detection, classification, regression and clustering and many more that appear to be open library modules also available.
Because we want to predict price, which is a number, we'll use a regression algorithm‚??-‚??linear regression in this case.
We train a model by giving it a set of data that includes the answer we want to predict (the price.) The model scans the data and looks for correlations between an automobile's features and its price and captures them in the model (a mathematical equation).
The studio will allow us to use our data for both pieces of training the model and testing it by splitting the data into separate training and testing datasets. This is done by dragging the Split Data Module to the experiment canvas and connecting it to the output of the Select Columns in Dataset module.
Click the Split Data module to select it and spilled the data into two parts using 75% of the data for training and the remaining 25% for scoring as shown.
At this point, we should run the experiment so that the Select Columns in Dataset and Split Data modules have the data and latest column definitions to pass forward when we score the model. This is done by pushing the run button at the bottom of the screen.
Now it's time to add the Learning algorithm we wish to use by expanding the Machine Learning Category on the left side of the canvas and expanding Initialize Model. Select the linear regression module and drag it to the canvas. Also, find and drag the Train Model module to the experiment canvas. Connect the output of the Linear Regression module to the input of the Train Model and connect the training data (left port) of the Split Data module to the Train Model as shown.
Click the Train Model module, and then click Launch column selector in the Properties pane and then select the price column. Price is the value that our model is going to predict. Move price from available columns to selected columns list.
At last, we can Run the experiment. We now have a trained regression model that can make price predictions.
Predict new automobile prices
Now that we've trained the model, we can use it to score the other 25 percent of the data to see how well our model functions. We do this by dragging the Score Model module to the experiment canvas and connecting the output of the Train Model to it. We then connect the test data output (right port) of the Split Data module to the Score Model module as shown.
Now run the experiment and view the output for Score Model by clicking the bottom port and selecting Visualize. The predicted prices are shown in the column Scored Labels along with all of the known feature data used by the model. The column price is the actual known price from the data set.
Deciding if our model is a good model?
The last step before we publish any model for use is to test the quality of the results. To do this we drag the Evaluate Model module onto the canvas and simply connect it to the output of the Score Model. Now when you run this experiment again, you will be able to visualize statistical results.
This area is where I think this service still could be better. To select the model that is best, a user will have to iteratively run many different experiments, save the results and compare them until you get the best fit model. That requires statistical knowledge and an understanding of when to use linear regression vs. classification vs. logistical regression or some of the many open source algorithms which might be great for modeling the data. Further, if the user is really a citizen data scientist, do they really understand how to interpret these results and use them in context to decide on the best model?
In this case, for each of the reported results, being smaller is better as that indicates predictions that more closely match the data (less error). The exception is the Coefficient of Determination (also called R squared) which we seek to make as close to 1.0 as an indication of model accuracy. This model is a .91 predictive accuracy in fitting the data to the line. I'd love to tell you that is pretty good, but really the answer is it depends on your data and what you are trying to predict and the consequences of being wrong. To get an idea of how complicated this is you can read the following from Duke https://people.duke.edu/~rnau/rsquared.htm. Citizen data scientists will need more help from Microsoft in this area.
Publish the model as a Cloud Service
One really nice feature is how easy Studio makes it get your model into production. For those of us who are not data engineers or IT and skilled at publishing a model in the cloud as an API for others to use, this makes getting your work out much easier.
The simplest way to publish a model is to use the Setup Web Service button and simply publish the model as a web service classic. This option converts your model from an experiment to a predictive experiment by eliminating data splits, training and other unnecessary steps in a model after you have decided on its features and algorithms. You run the model one last time to check the results and you are ready to go with an API key for others to use on Azure.
Scalability & Performance
Deploying does raise the issues of what am I deploying to‚??-‚??What are the hardware resources I can use, what is my service level guarantee, is it a dedicated or multi-tenant cloud, security, etc..?
Azure Machine Learning Service is multi-tenant and computes resources used on the back end will vary and are optimized for performance and predictability. The studio can be deployed on free tier with data sets no larger than 10 GB or on the standard paid tier which allows the use of many more paid resources and BYO storage.
One of the hot issues in data science is using GPUs to achieve very fast computing performance. I do see the Azure offers GPUs for compute-intensive applications, but I don't see a way to directly specify GPUs for Studio as it is a multi-tenant service. Perhaps the way to guarantee the compute performance needed is through the paid SLA that is part of the standard paid tier or Microsoft may have other methods of guaranteeing GPU access that isn't obvious in what I have read.
Microsoft has done an outstanding job of building a cloud service that clarifies, simplifies and ensures the integrity of the process of building machine learning models. Their process lays out visually a simple clear method of acquiring data, offers tools for cleansing data and choices of models. The studio goes on to require training and model scoring and will not let you proceed unless prior steps are correctly executed. Ultimately, Studio even makes deployment of models as a web service easy.
But as I said in the title, Studio is not a panacea that allows anyone to build machine learning models. Machine learning is complex and data science knowledge is still required. A user would need statistical knowledge to understand which algorithms to choose, how to choose features and to interpret the scoring results as to which model is the best fit for your circumstances. Also, to allow more flexibility modules may be inserted into the flow of cleansing and transforming data that can be custom built in R or Python, etc.. This requires programming skills.
To take this product the next step for citizen data scientists Microsoft must offer more data science intelligence built in. Imagine if Studio could look at your data and what you want to predict and run through a series of algorithms, try different features, score the models and then offer up the model with the best-recommended fit! That would save the user a lot of time, effort, cost and require less heavy data science knowledge of the user.
I think Studio will work well for many users seeking to build models including citizen data scientists seeking a drag-and-drop solution with pre-built algorithms to more advanced enterprise users who can incorporate Studio work into an even broader ecosystem of Microsoft's Azure Machine Learning Service. This service allows data scientists to work in a Python environment, provides more control over machine learning algorithms, deployment and supports open-source machine learning frameworks like PyTorch, TensorFlow, and scikit-learn. I look forward to working with this product more.