Nand Kishor Contributor

Nand Kishor is the Product Manager of House of Bots. After finishing his studies in computer science, he ideated & re-launched Real Estate Business Intelligence Tool, where he created one of the leading Business Intelligence Tool for property price analysis in 2012. He also writes, research and sharing knowledge about Artificial Intelligence (AI), Machine Learning (ML), Data Science, Big Data, Python Language etc... ...

Full Bio 
Follow on

Nand Kishor is the Product Manager of House of Bots. After finishing his studies in computer science, he ideated & re-launched Real Estate Business Intelligence Tool, where he created one of the leading Business Intelligence Tool for property price analysis in 2012. He also writes, research and sharing knowledge about Artificial Intelligence (AI), Machine Learning (ML), Data Science, Big Data, Python Language etc...

3 Best Programming Languages For Internet of Things Development In 2018
377 days ago

Data science is the big draw in business schools
550 days ago

7 Effective Methods for Fitting a Liner
560 days ago

3 Thoughts on Why Deep Learning Works So Well
560 days ago

3 million at risk from the rise of robots
560 days ago

Top 10 Hot Artificial Intelligence (AI) Technologies
313017 views

Here's why so many data scientists are leaving their jobs
81345 views

2018 Data Science Interview Questions for Top Tech Companies
78039 views

Want to be a millionaire before you turn 25? Study artificial intelligence or machine learning
77109 views

Google announces scholarship program to train 1.3 lakh Indian developers in emerging technologies
61908 views

Machine Learning Zero-to-Hero: Everything you need in order to compete on Kaggle for the first time, step-by-step!

By Nand Kishor |Email | Apr 2, 2018 | 25005 Views

I recently came across Rachel Tomas's article on the importance and value of writing about what you learn, and Julia Evans's advice on why and how to write, and thus I have decided to follow their advice and write an article (for the first time ever!).

This article will be about everything I wish I had known a year ago , when I first decided I want to learn more about Data Scienceâ??-â??it is meant for anyone who is interested in getting into Data Science, either as a hobby or as a potential career. Having done a MOOC first and knowing some basic Python will help you get the most out of this article, but it is not required by any means. This article is not meant to showcase anything impressive (sorry mom, dad, and potential employers), but rather to go over the basics and help beginners start off strong.

Topics covered:
  1. Introduction.
  2. Overview of Kaggle.
  3. Setting up your own environment.
  4. Overview of the Predict House Prices competition.
  5. Loading and inspecting the data.
  6. Our model: explanation of Decision Trees, biasâ??-â??variance tradeoff, and Random Forests.
  7. Preprocessing the data.
  8. Putting it all together and submitting.


Introduction
Nowadays, there are several high-quality Machine Learning tutorials and MOOCs available online for free. A year ago, I binged Udacity's Intro to ML, which I found extremely approachable and beginner-friendly, and which did a good job of introducing me to basic ML concepts, several popular algorithms, and scikit-learn's API. After finishing the course, I was very excited to learn more, but felt a little lost.

After conducting some research, I decided the best thing to do next is check out Kaggle, which is a popular Google-owned platform for predictive modelling competitions. Nothing beats hands-on learning through practice!

Competing on Kaggle for the first time is daunting and often frustrating (and achieving a decent score even more so!), and so this article will focus on how to enter your first competition and utilize Kaggle to maximize your personal growth and success.

Kaggleâ??-â??Overview

If you are already intimately familiar with Kaggle, feel free to skip to the next section. Otherwise:

The two Kaggle competitions which are most suitable for beginners (and serve as Kaggle's version of 'tutorials') are the Titanic (predicting survivalâ??-â??binary classification problem), and House Prices (predicting priceâ??-â??regression problem). While I highly recommend checking out and competing on both, this article will focus on the latter. However, much of the content is general and equally applicable to other Kaggle competitions and Data Science problems, so feel free to choose the Titanic or some other competition if you like!

  • At each competition's Overview tab, you can see some background information about the competition and its dataset(s), the evaluation metric by which submissions will be scored (which changes from competition to competition), and competition-specific FAQ.
  • At the Data tab you can see a brief description of the data. We will need three files: train.csv, test.csv, and data_description.txt (which is crucial, as it contains a much more detailed description of the data)- put them in a folder which you can easily access.
  • The Discussions tab is like a competition-specific forumâ??-â??however, don't underestimate it! In popular running competitions it often contains valuable information, as the competition terms sometimes require participants to publicly share any outside information they use on the Discussions board. For example, data leakage is hard to avoid and deal with and occurs in competitions occasionally . On the one hand, exploiting it well is critical for the purpose of achieving the highest score and winning the competition, but on the other hand, models incorporating data leakage are usually useless for any practical purposes and for the organizers of the competition, as they contain 'illegal' information. Often, a diligent participant will share the leakage publicly on the Discussion board to level the playing field and take away its competitive edge. Other than that, Kagglers often share information in a joint effort to learn and grow as a community. People who rank high on the leaderboard will sometimes share their successful approach (usually either early on or after the competition is finished).
  • The Kernel tab is basically the applied, code-based version of the Discussion board, and in my opinion the most important tab to explore as a beginner. Anyone can share any script or notebook, connected to any dataset or competition, and complete with documentation, explanations, visualizations, and outputs, which everyone can then go through, vote on, copy and paste from, and even run entirely in their browser! Both of the aforementioned competitions have many interesting, beautiful, and highly successful Kernels, and I strongly recommend going through them later after having first given it a go by yourself. Although a relatively new feature, Kaggle is constantly improving Kernels, and there is even a $100,000 competition going on right now in 'Kernels Only' mode. However, Kernels can often be overwhelming, lacking in conceptual explanations, or assuming prior knowledge. In other words, they often show what works, but sometimes not how they got there or why it works.

Setting up our own environment
I highly recommend using Python 3.6 and working in the Jupyter Notebook environment for anything Data Science related (the most popular distribution is called 'Anaconda', and includes Python, Jupyter Notebook, and many useful libraries). You can then start the environment at any time by typing jupyter notebook into your Terminal (or through the Anaconda GUI). Otherwise, everything shown here can be done in a private Kernel on the Kaggle website (entirely in your browser), which is essentially identical to a Jupyter Notebook.

A few essential Jupyter Notebook tips before we begin:

  • You can start typing any method name and hit 'Tab' to see a list of all possible options.
  • Similarly, selecting any method and hitting 'Shift-Tab' a few times will open up its documentation in your notebook.
  • Typing %time before any statement and executing the cell will print how long it takes to execute.
  • Similarly, typing %prun before any statement and executing the cell will run it through Python's code profiler and print the results.
  • For a comprehensive list of 'magic' commands, please refer to the documentation. Onward!

A Step-by-Step Guide to Predicting House Prices
Overview of the objective
This is a supervised learning problemâ??-â??meaning we have a training set which includes a number of observations (rows) with various pieces of information about them (columns). One of those columns is the one we are interested in being able to predictâ??-â??often called the 'target' variable or the 'dependent' variable, sometimes also called 'label' or 'class' in classification problems. In our case, this is the Sale Price (surprise surprise). The other columns are often called 'independent' variables, or 'features'. We also have a test set, which will also have a number of observations, with exactly the same columns except for the target variable, which will be missing and it is our job to predict a value for. Therefore, ideally we want to build a model which can learn the relationship between the independent variables and the dependent variable based on the training set, and then use that knowledge to predict the dependent (or target) variable for the test set as accurately as possible. Since the target variable is continuousâ??-â??SalePrice, which can take any valueâ??-â??this problem is called a 'regression' problem.

Load the data and take a peek
Now that we have a Jupyter Notebook up and running, the first thing we want to do is load the data into a Pandas DataFrame. Pandas is a popular and powerful library that handles everything related to data analysis in Python, and DataFrame is the name of the object it uses to store data.

The last line loads the CSV file we downloaded from Kaggle (stands for 'comma-separated-values', a common format which can also be viewed directly with any standard software like Excel) into a Pandas DataFrame using the convenient Python 3.6 string format. Since you will be using read_csv a lot, I recommend skimming its documentation now. In general, if you ever come across any method more than onceâ??-â??it's a good habit to skim through its documentation, and as noted above, you can also do so directly in your notebook! The first time I loaded and looked at the DataFrame, I noticed that the first column in the dataset is Id and represents the index of that row in the dataset, rather than being an actual variable. Therefore, I went back and passed index_col='Id' as an argument when loading the data into a DataFrame to make sure Pandas uses it as the index instead of treating it as a column and adding a new index column before itâ??-â??see how reading the documentation is already coming in handy?

Now, let's see what the training set looks like!

As we can see, Id is now applied correctly. Additionally, the training set has 80 columns overall (excluding Id), out of which 79 are the independent variables and 1 is the dependent variable. Therefore, we would expect the test set to only have 79 columns (the independent variables), and that is indeed the caseâ??-â??go check it now!

Most of those numbers and strings don't mean much to us yet, and the sharp observer will notice that the values for the 'Alley' column are suspiciously all 'NaN' (stands for 'Not a Number'), meaning the values are missing. Not to worryâ??-â??we will deal with that eventually.

The next step is to start thinking about what kind of model we would like to use. Thus, let's take a break from the data for a few minutes and talk about Decision Trees (sometimes instead called 'Regression Trees' when applied to regression problems). I'll come back to the data afterwards, so please bear with me!

A Brief Explanation of Modelling
Intro to Decision Trees
The basic idea here is simpleâ??-â??when learning the training data (usually termed 'being fitted' to the training data), the Regression Tree searches over all of the independent variables, and then over all the values of each one of the independent variables, to find the variable and value which best split the data into two similar groups (in mathematical terms, the tree always chooses the split which minimizes the weighted average variance of the two resulting nodes), and then calculates the score (based on the chosen metric) and the average value of the dependent variable for each of the groups. The tree then repeats this process recursively until there are no more splits to perform (unless max_depth was explicitly specified, like in the picture below). Each of the nodes at the last level of the tree is called a 'leaf', and is each associated with the average value of the dependent variable across all of the observations that are in the group that made it to that leaf.

As a side note, this is a great example of a 'greedy' algorithmâ??-â??at every split, it checks all the options and then chooses the one that seems like the best option at that point, with hopes of eventually achieving an overall good result. After the tree was fitted to the training data, any observation for which we wish to predict a value for the dependent variable simply has to traverse the tree until it reaches a leafâ??-â??an end nodeâ??-â??and then it gets assigned the leaf's corresponding dependent variable value.

Let's take a closer look at the tree shown here: in each node, the first element is the node's split rule (the independent variable and its value), the second element is the Mean Squared Error (MSE) of all the observations in that node, and the third element is the number of observations in that node ('samples')â??-â??the size of the group. The last element, 'value', is the natural logarithm of our target/dependent variableâ??-â??'SalePrice'. As we can see, it seems like the greedy approach of making the locally best split at every node does indeed generally decrease the MSE as the tree expands, and each leaf has an associated 'SalePrice' value.

Biasâ??-â??Variance Tradeoff
So let's think back to our objective in supervised learning. On the one hand, we would like our model to be able to capture the relationships between the independent variables and the dependent variable as it gets fitted to the training data, so that it can then make accurate predictions. However, the data the model will have to predict the dependent variable for will necessarily be different than the data the model was trained on. In our case, it is the Kaggle test set. Therefore, we would like our model to capture the general relationships between the independent variables and the dependent variable, so that it can generalize to unseen data and predict well. This is sometimes known as the 'biasâ??-â??variance tradeoff'.

If our model doesn't learn enough from the training set, then it will have a high biasâ??-â??often called 'underfitting'â??-â??meaning it did not capture all the information that is available in the training set, and therefore its predictions would not be as good. However, if our model learns our training data too well, it will capture the specific relationships between the independent variables and the dependent variable in the training set instead of the general onesâ??-â??it will have a high variance, often called 'overfitting'â??-â??and therefore it will generalize to unseen data poorly and again its predictions would not be as good. Clearly, we must seek a balance between the model's bias and variance.

Decision Tree Overfit
Imagine we fit a Regression Tree to our training set. What will the tree look like? As you probably guessed, it will keep splitting until there is only a single observation in every leaf (as there are no more splits to perform as that point). In other words, the tree will build a unique path for each of the observations in the training set, and will give the leaf at the end of the path the dependent value of its associated observation.

If I were to then drop the dependent variable from my training set, and ask my tree to predict the dependent variable value for each of the observations in the training set, what would happen? As you might imagine, it would do so perfectlyâ??-â??achieving basically 100% accuracy and 0 MSEâ??-â??as it has already learned the dependent variable value associated with each of the observations in the training set.

However, if I were to ask the tree to predict the dependent variable value for unseen observationsâ??-â??ones it was not trained onâ??-â??it would likely perform poorly, as any unseen observation would end up getting assigned a dependent variable value from a leaf which was constructed for a single specific observation in the training set. This is an example of 'overfitting'. It is possible to fiddle with the tree's parameters in order to reduce the overfitâ??-â??for example, limit the tree's max_depthâ??-â??but it turns out there's a better solution!

The Solutionâ??-â??Random Forest
In ML, we often design meta-models which combine the prediction of several smaller models to generate a better final prediction. This is generally called 'ensembling'. Specifically, several decision trees are often combined together in an ensemble method called 'Bootstrap Aggregating', or 'Bagging' for short. The resulting meta-model is called a 'Random Forest'.

Random Forests are simple but effective. When being fitted to a training set, many decision trees are constructed, just like the one aboveâ??-â??only each tree is fitted on a random subset of the data (a 'bootstrap sample', meaning drawn with replacement from the entire dataset), and can only consider a random subset of the independent variables ('features') at every split. Then, to generate a prediction for a new observation, the Random Forest simply averages the predictions of all its trees and returns that as its prediction.

But wait, Oren! All we're doing is building many weaker trees and then taking their averageâ??-â??why would this work!?

Well, the short answer is that it works very well, and you should try reading up more on Random Forests if you're interested in the statistical explanation. I'm not very good at statistics, but I'll try to give a basic explanationâ??-â??the bootstrap sampling and the feature subset are meant to make the trees as uncorrelated as possible (although they are all still based on the same data set and feature set), allowing each tree to discover slightly different relationships in the data. This results in their average having much less varianceâ??-â??less overfitâ??-â??than any single tree, and therefore better generalization and prediction overall.

In simpler terms, for an unseen observation each of the decision trees predict the dependent variable value of the leaf the observation ends up at, meaning the value of the most-similar training set observation in that specific tree-space. As we remember, each tree is constructed differently on different data and therefore each tree will define similarity in a different way and predict a different value, and therefore for a given unseen observation the average of all the trees is basically the average of all the values of a lot of observations in the training set which are somehow similar to it.

One consequence of this property is that while Random Forests are very good at prediction when the test set is somewhat similar to the training set (in the same range of values), which is usually the case, they are terrible at prediction when the test set is different in some fundamental way from the training set (different range of values), like in Time Series problems for example (where the training set is from one time period and the test set is from a different time period).

Since in our case the test set and the training set have the same range of values, we should be good to go!

Back to the Competition
Mandatory Preprocessing
One last thing is left to do before we get our Random Forest rolling. Unfortunately, while Random Forests are theoretically capable of dealing with both categorical features (which are non-numerical, so strings) and missing data, the scikit-learn implementation does not support either. For now, we will fill in the missing values using pd.interpolate(), and then convert categorical features to numerical ones using pd.get_dummies(), which uses a scheme called 'One-Hot Encoding'. The idea is simpleâ??-â??let us imagine a categorical variable with n possible values. This column then gets split into n separate columns, each corresponding to one of the original values (essentially equivalent to 'is_value?' for each one of the original values). Each observation, which previously had a string value for the categorical variable, now has a 1 in the column corresponding to the old string value and a 0 in all the rest (this is called 'One-Hot').

We are now ready to construct a model, fit it to the training data, use it to predict on the test set, and submit the predictions to Kaggle!

Putting it all together and submitting the results
This is all the code that is needed in order to submit our model's predictions to Kaggleâ??-â??about 20 lines! I ran this code, and then went ahead and submitted the results to Kaggleâ??-â??the score was 0.14978, which is about 63% percentile currently. Not bad at all for 5 minutes of coding! We can see the power of the Random Forest at work here.
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
PATH = "Oren/Kaggle/Housing Prices/"  #where you put the files
df_train = pd.read_csv(f'{PATH}train.csv', index_col='Id')
df_test = pd.read_csv(f'{PATH}test.csv', index_col='Id')
target = df_train['SalePrice']  #target variable
df_train = df_train.drop('SalePrice', axis=1)
df_train['training_set'] = True
df_test['training_set'] = False
df_full = pd.concat([df_train, df_test])
df_full = df_full.interpolate()
df_full = pd.get_dummies(df_full)
df_train = df_full[df_full['training_set']==True]
df_train = df_train.drop('training_set', axis=1)
df_test = df_full[df_full['training_set']==False]
df_test = df_test.drop('training_set', axis=1)
rf = RandomForestRegressor(n_estimators=100, n_jobs=-1)
rf.fit(df_train, target)
preds = rf.predict(df_test)
my_submission = pd.DataFrame({'Id': df_test.index, 'SalePrice': preds})
my_submission.to_csv(f'{PATH}submission.csv', index=False)

Explanation
After loading the training set and the test set into separate DataFrames, I saved the target variable and then dropped it from the DataFrame (as I want to keep only the independent variables, the features, in the DataFrame). Then, I added a new temporary column ('training_set') to both the training set and test set that distinguishes between them so that I can concatenate (put them together in the same DataFrame) them now and then later separate them again. I then went ahead and concatenated them, filled in missing values, and converted categorical features into numerical ones via One-Hot Encoding. As mentioned, Random Forests (and in general most algorithms) work better when the training set and the test set have similar values, and therefore whenever I modify anything I try to modify both sets together. Otherwise, interpolate might fill in different values for the training set and the test set, and get_dummies might encode the same categorical feature in two different ways, which would result in worse performance. I then separated them again and got rid of the temporary column, created a Random Forest with 100 trees (in general, the more trees the better the result will be up to a certain point, but also the more time training will take) using all of my computer's CPU cores (n_jobs=-1), fit it using my training set, used the fitted Random Forest to predict the target variable for my test set, put the results together with their respective Id in one DataFrame, and saved it to a CSV file. I then went to the competition's page on Kaggle and submitted the CSV file. Voila!

What's Next?
In the next articles, we will dive deeper into this competition and discuss visualization, a better-informed approach to preprocessing, feature engineering, model validation, and hyperparameter tuning. As homework, try to learn more about these topics, perhaps read a Kaggle Kernel or two, and see if you can improve your score and get into the top 50%, or submit your predictions for the Titanic competition.

Recommended Resources

Acknowledgements
  • I'd like to thank Udacity for its incredible Intro to ML course, which first introduced me to this amazing world.
  • I'd like to thank Rachel Thomas and Jeremy Howard of fast.ai, for teaching me 90% of what I know about Machine Learning and Deep Learning. Seriously, fast.ai is amazing, please go check it out. Apart from the interesting blog, they are currently offering top-notch MOOCs on Deep Learning and Computational Linear Algebra for free, and soon their Machine Learning MOOC will be released to the public as well (this MOOC served as the inspiration for this article). If anyone is interested specifically in the ML MOOC and cannot wait, please shoot me an email.


The article was originally published here

Source: TDS