How I Developed A Machine Learning Model From Scratch

By Kimberly Cook |Email | Dec 26, 2018 | 19368 Views

In this article, we are going to study in depth how the process of developing a machine learning model is done. There will be a lot of concepts explained and we will reserve others that are more specific to future articles.

Concretely, in the article will be discussed how to:
  • Define adequately our problem (objective, desired outputs...).
  • Gather data.
  • Choose a measure of success.
  • Set an evaluation protocol and the different protocols available.
  • Prepare the data (dealing with missing values, with categorical values...).
  • Split correctly the data.
  • Differentiate between over and underfitting, what are those issues and how to avoid them.
  • An overview of how a model learns.
  • What is regularization and when is appropriate to use it?
  • Develop a benchmark model.
  • Choose an adequate model and tune it to get the best performance possible.

Universal Workflow for Addressing Machine Learning Problems
1. Define Appropriately the Problem
The first and one of the most critical things to do is to find out what are the inputs and the expected outputs. 

There should be answered the following questions:
  • What is the main objective? What are we trying to predict?
  • What are the target features?
  • What is the input data? Is it available?
  • What kind of problem are we facing? Binary classification? Clustering?
  • What is the expected improvement?
  • What is the current status of the target feature?
  • How is going to be measured the target feature?
Not every problem can be solved until we have a working model we just can make a certain hypothesis:

  • Our outputs can be predicted given the inputs.
  • Our available data is sufficient information to learn the relationship between the inputs and the outputs

Is crucial to keep in mind that machine learning can only be used to memorize patterns that are present in the training data, so we can only recognize what we have seen before. When using Machine Learning we are making the assumption that the future will behave like the past, and this isn't always true.

2. Collect Data
This is the first real step towards the real development of a machine learning model, collecting data. This is a critical step that will cascade in how good the model will be, the more and better data that we get, the better our model will perform.
There are several techniques to collect the data, like web scraping, but they are out of the scope of this article.

Typically our data will have the following shape:

Note: The previous table corresponds to the famous Boston housing dataset, a classical dataset frequently used to develop simple machine learning models. Each row represents a different Boston's neighborhood and each column indicates some characteristic of the neighborhood (criminality rate, average age... etc). The last column represents the median house price of the neighborhood and it is the target, the one that will be predicted taking into account the other.

3. Choose a Measure of Success:
Peter Drucker, Harvard teacher and author of The Effective Executive and Managing Oneself, had a famous saying:
"If you can't measure it you can't improve it".
If you want to control something it should be observable, and in order to achieve success, is essential to define what is considered success: Maybe precision? accuracy? Customer-retention rate?
This measure should be directly aligned with the higher level goals of the business at hand. And it is also directly related to the kind of problem we are facing:
  • Regression problems use certain evaluation metrics such as mean squared error (MSE).
  • Classification problems use evaluation metrics as precision, accuracy and recall.

On the next articles we'll explore in depth these metrics, what are the most adequate to use considering the problem faced, and learn how to set them up.

4. Setting an Evaluation Protocol
Once is clear the goal to achieve, it should be decided how is going to be measured the progress towards achieving the goal. The most common evaluation protocols are:

4.1 Maintaining a holdout validation set
This method consists of setting apart some portion of the data as the test set.
The process would be to train the model with the remaining fraction of the data, tunning its parameters with the validation set and finally evaluating its performance on the test set.
The reason to split data into three parts it is to avoid information leaks. The main inconvenient of this method is that if there is little data available, the validation and test sets will contain so few samples that the to tuning and evaluation processes of the model will not be effective.

4.2 K-Fold Validation
K-Fold consists of splitting the data into K partitions of equal size. For each partition I, the model is trained with the remaining K-1 partitions and is evaluated on partition i.
The final score is the average of the K scored obtained. This technique is especially helpful when the performance of the model is significantly different from the train-test split.

4.3 Iterated K-Fold Validation with Shuffling
This technique is especially relevant when having little data available and it is needed to evaluate the model as precisely as possible (it is the standard approach to Kaggle competitions).

It consists of applying K-Fold validation several times and shuffling the data every time before splitting it into K partitions. The Final score is the average of the scores obtained at each run of K-Fold validation.

This method can be very expensive computationally as the number of trained and evaluating models would be I x K times. Being I the number of iterations and K the number of partitions.

Note: It is crucial to keep in mind the following when choosing an evaluation protocol:

  • In classification problems, both training and testing data should be representative of the data, so we should shuffle our data before splitting it, to make sure that is covered the whole spectrum of the dataset.
  • When trying to predict the future given the past (weather prediction, stock price prediction...), data should not be shuffled, as the sequence of data is a crucial feature and doing so would create a temporal leak.
  • We should always check if there are duplicates in our data in order to remove them. Otherwise, the redundant data may appear both in the training and testing sets and cause inaccurate learning on our model.

5. Preparing The Data
Before beginning to train models we should transform our data in a way that can be fed into a Machine Learning model. The most common techniques are:

5.1 Dealing with missing data
It is quite common in real-world problems to miss some values of our data samples. It may be due to errors on the data collection, blank spaces on surveys, measurements not applicable...etc
Missing values are typically represented with the "NaN" or "Null" indicators. The problem is that most algorithms can't handle those missing values so we need to take care of them before feeding data to our models. Once they are identified there are several ways to deal with them:
  • Eliminating the samples or features with missing values. (we risk to delete relevant information or too many samples)
  • Imputing the missing values, with some pre-built estimators such as the Imputer class from sci-kit learn. We'll fit our data and then transform it to estimate them. One common approach is to set the missing values as the mean value of the rest of the samples.

5.2 Handling Categorical Data
When dealing with categorical data, we work with ordinal and nominal features. Ordinal features are categorical features that can be sorted (cloth's size: L<M<S). While nominal features don't imply any order (cloth's color: yellow, green, red).

The methods to deal with ordinal and nominal features are:
  • Mapping ordinal features: to make sure that the algorithm interprets the ordinal features correctly, we need to convert the categorical string values into integers. Frequently we will do this mapping manually. Example: L:2, M:1, S:0.
  • Encoding nominal class labels: The most common approach is to perform one-hot encoding, which consists in creating a new dummy feature for each unique value in the nominal feature column. Example: in the color column, if we have three classes: yellow, red, green and perform one-hot encoding, we will get three new columns, one for each unique class. Then if we have a yellow shirt, it will be sampled as yellow = 1, green = 0, red = 0. This is done for ensuring the good performance of the algorithm as they are much more efficient when dealing with sparse matrix (low dense matrixes, with a lot of 0's values).

5.3 Feature Scaling
This is a crucial step in the preprocessing phase as the majority of machine learning algorithms perform much better when dealing with features that are on the same scale. The most common techniques are:
  • Normalization: it refers to rescaling the features to a range of [0,1], which is a special case of min-max scaling. To normalize our data we'll simply need to apply the min-max scaling method to each feature column.

  • Standardization: it consists in centering the feature columns at mean 0 with standard deviation 1 so that the feature columns have the same parameters as a standard normal distribution (zero mean and unit variance). This makes much easier for the learning algorithms to learn the weights of the parameters. In addition, it keeps useful information about outliers and makes the algorithms less sensitive to them.

5.4 Selecting Meaningful Features
As we will see later, one of the main reasons that cause machine learning models to overfit is because of having redundancy in our data, which makes the model be too complex for the given training data and unable to generalize well on unseen data.

One of the most common solution to avoid overfitting is to reduce data's dimensionality. This is frequently done by reducing the number of features of our dataset via Principal Component Analysis (PCA) which is a type of Unsupervised Machine Learning algorithm.

PCA identifies patterns in our data based on the correlations between the features. This correlation implies that there is redundancy in our data, in other words, that there is some part of the data that can be explained with other parts of it.

This correlated data is not necessary for the model to learn appropriately and so, it can be removed. It may be removed by directly eliminating certain columns (features) or by combining a number of them and getting new ones that hold the most part of the information. We will dig deeper into this technique in future articles.

5.5 Splitting Data into subsets
In general, we will split our data into three parts: training, testing and validating sets. We train our model with training data, evaluate it on validation data and finally, once it is ready to use, test it one last time on test data.

Now, is reasonable to ask the following question: Why not having only two sets, training, and testing? In that way, the process will be much simpler, just train the model on training data and test it on testing data.
The answer is that developing a model involves tunning its configuration, in other words, choosing certain values for their hyperparameters (which are different from the parameters of the model - network's weights). 

This tunning is done with the feedback received from the validation set and is, in essence, a form of learning.

The ultimate goal is that the model can generalize well on unseen data, in other words, predict accurate results from new data, based on its internal parameters adjusted while it was trained and validated.

a) Learning Process
We can take a closer look at how the learning process is done by studying one of the simplest algorithms: Linear Regression.

In Linear regression, we are given a number of predictor (explanatory) variables and a continuous response variable (outcome), and we try to find a relationship between those variables that allows us to predict a continuous outcome.

An example of linear regression: given X and Y, we fit a straight line that minimizes the distance using some methods to estimate the coefficients (like Ordinary Least Squares and Gradient Descent) between the sample points and the fitted line. Then, we'll use the intercept and slope learned, that form the fitted line, to predict the outcome of new data.

The formula for the straight line is y = B0 + B1x +u. Where x is the input, B1 is the slope, B0 the y-intercept, u the residual and y is the value of the line at the position x.

The values available for being trained are B0 and B1, which are the values that affect the position of the line since the only other variables are x (the input and y, the output (the residual is not considered). These values (B0 and B1) are the "weights" of the predicting function.

These weights and other, called biases, are the parameters that will be arranged together as matrixes (W for the weights and b for the biases).

The training process involves initializing some random values for each of the training matrixes and attempt to predict the output of the input data using the initial random values. In the beginning, the error will be large, but comparing the model's prediction with the correct output, the model is able to adjust the weights and biases values until having a good predicting model.

The process is repeated, one iteration (or step) at a time. In each iteration, the initial random line moves closer to the ideal and more accurate one.

b) Overfitting and Underfitting
One of the most important problems when considering the training of models is the tension between optimization and generalization.
  • Optimization is the process of adjusting a model to get the best performance possible on training data (the learning process).
  • Generalization is how well the model performs on unseen data. The goal is to obtain the best generalization ability.
At the beginning of training, those two issues are correlated, the lower the loss of training data, the lower the loss of test data. This happens while the model is still under fitted: there is still learning to be done, it hasn't been modeled yet all the relevant parameters of the model.

But, after a number of iterations on the training data, generalization stops to improve and the validation metrics freeze first and then start to degrade. The model is starting to overfit: it has learned so well the training data that has learned patterns that are too specific to training data and irrelevant to new data.

There are two ways to avoid this overfitting, getting more data and regularization.

  • Getting more data is usually the best solution, a model trained on more data will naturally generalize better.
  • Regularization is done when the latter is not possible, it is the process of modulating the quantity of information that the model can store or add constraints on what information it is allowed to keep. If the model can only memorize a small number of patterns, the optimization will make it focus on the most relevant ones, improving the chance of generalizing well.

Regularization it is done mainly by the following techniques:
  1. Reducing the model's size: Reducing the number of learnable parameters in the model, and with them its learning capacity. The goal is to get to a sweet spot between too much and not enough learning capacity. Unfortunately, there aren't any magical formulas to determine this balance, it must be tested and evaluated by setting a different number of parameters and observing its performance.
  2. Adding weight regularization: In general, the simpler the model the better. As long it can learn well, a simpler model is much less likely to overfit. A common way to achieve this is to constrain the complexity of the network by forcing its weights to only take small values, regularizing the distribution of weight values. This is done by adding to the loss function of the network a cost associated with having large weights. The cost comes in two ways:
  • L1 regularization: The cost is proportional to the square of the value of the weight coefficients (L1 norm of the weights).
  • L2 regularization: The cost is proportional to the square of the value of the weight coefficients (l2 norm of the weights)

To decide which of them to apply to our model is recommended to keep the following information in mind and take into account the nature of our problem:

6. Developing a Benchmark model
The goal in this step of the process is to develop a benchmark model that serves us as a baseline, upon we'll measure the performance of a better and more attuned algorithm.

Benchmarking requires that experiments be comparable, measurable, and reproducible. It is important to emphasize the reproducible part of the last statement. Nowaday's data science libraries perform random splits of data, this randomness must be consistent through all runs. Most random generators support setting seed for this purpose. In Python, we will use the random. seed method from the random package.

As found on ""

"It is often valuable to compare model improvement over a simplified baseline model such as a kNN or Naive Bayes for categorical data, or the EWMA of value in time series data. These baselines provide an understanding of the possible predictive power of a dataset.

The models often require far less time and compute power to train and predict, making them a useful cross-check as to the viability of an answer. Neither kNN nor Naive Bayes models are likely to capture complex interactions. They will, however, provide a reasonable estimate of the minimum bound of predictive capabilities of a benchmarked model.

Additionally, this exercise provides the opportunity to test the benchmarking pipeline. It is important that benchmark pipelines provide stable results for a model with understood performance characteristics. A kNN or a Naive Bayes on the raw dataset, or minimally manipulated with column centering or scaling, will often provide a weak, but adequate learner, with characteristics that are useful for the purposes of comparison. The characteristics of more complex models may be less understood and prove challenging."

7. Developing a Better Model & Tunning its Hyperparameters
7.1 Finding a Good Model
One of the most common methods for finding a good model is cross-validation. In cross-validation we will set:
  • A number of folds in which we will split our data.
  • A scoring method (that will vary depending on the problem's nature - regression, classification...).
  • Some appropriate algorithms that we want to check.

We'll pass our dataset to our cross-validation score function and get the model that yielded the best score. That will be the one that we will optimize, tunning its hyperparameters accordingly.

# Test Options and Evaluation Metrics
num__folds = 10
scoring = "neg__mean__squared__error"
# Spot Check Algorithms
models = []
models.append(('LR', LinearRegression()))
models.append(('LASSO', Lasso()))
models.append(('EN', ElasticNet()))
models.append(('KNN', KNeighborsRegressor()))
models.append(('CART', DecisionTreeRegressor()))
models.append(('SVR', SVR()))

results = []
names = []
for name, model in models:
    kfold = KFold(n__splits=num__folds, random__state=seed)
    cv__results = cross__val__score(model, X__train, y__train, cv=kfold,    scoring=scoring)
    msg = "%s: %f (%f)" % (name, cv__results.mean(),   cv__results.std())

# Compare Algorithms
fig = pyplot.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add__subplot(111)

7.2 Tunning The Model's Hyperparameters
A machine learning algorithm has two types of parameters. the first type is the parameters that are learned through the training phase and the second type are the hyperparameters that we pass to the machine learning model.

Once identified the model that we will use, the next step is to tune its hyperparameters to obtain the best predictive power possible. The most common way to find the best combination of hyperparameters is called Grid Search Cross-Validation.

The process would be the following:
  • Set the parameter grid that we will evaluate. We will do this by creating a dictionary of all the parameters and their corresponding set of values that you want to test for best performance
  • Set the number of folds and the random state and a scoring method.
  • Build a K-Fold object with the selected number of folds.
  • Build a Grid Search Object with the selected model and fit it.

# Build a scaler
scaler = StandardScaler().fit(X__train)
rescaledX = scaler.transform(X__train)
# Build parameter grid
c__values = [0.1, 0.3, 0.5, 0.7, 0.9, 1.0, 1.3, 1.5, 1.7, 2.0]
kernel__values = ['linear', 'poly', 'rbf', 'sigmoid']
param__grid = dict(C=c__values, kernel=kernel__values)
# Build the model
model = SVC()
kfold = KFold(n__splits=num__folds, random__state=seed)
grid = GridSearchCV(estimator=model, param__grid=param__grid, scoring=scoring, cv=kfold)
grid__result =, y__train)
# Show the results
print("Best: %f using %s" % (grid__result.best__score__, grid__result.best__params__))
means = grid__result.cv__results__['mean__test__score']
stds = grid__result.cv__results__['std__test__score']
params = grid__result.cv__results__['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

This method returns a set of hyperparameters that fits best with the problem at hand. Once they are determined, our model is ready to be used. So we'll make the appropriate predictions on the validation dataset and save the model for later use.

8. Conclusion
We have covered a lot of important concepts through this article. Although having provided a high-level overview of them, this it is necessary to gain a good intuition on how and when to apply the methods explained.
We will explore these methods in more depth as they will come up in the next articles, as well as they python implementations.
In the next article, we will begin with the first and most common type of Machine Learning problems: Regression.

Thanks for reading and stay tuned!

The article was originally published here

Source: HOB