To discover interesting transactional patterns of different customers and countries
In general explanation, data science is nothing more than using advanced statistical and machine learning techniques to solve various problems using data and data analysis. Yet, it's easier to just dive into applying some fancy machine learning algorithms -and Voila! You got the prediction - without first understanding the data.
EDA is so important for 3 reasons (at least) as stated below:
Make sure business stakeholders ask the right questions - often by exploring and visualizing data - and validate their business assumptions with a thorough investigation
Spot any potential anomalies in data to avoid feeding wrong data to a machine learning model
Interpret the model output and test its assumptions
There you have it. Now that we have already understood the "WHAT and WHY" aspects of EDA, let's examine a dataset together and go through the "HOW" that will eventually lead us to discover some interesting patterns, as we'll see in the next section.
We'll focus on the overall workflow of EDA, visualization and its results. For technical reference, please refer to my notebook on Kaggle anytime you want to have a more detailed understanding of the codes.
To give a brief overview, this post is dedicated to 5 sections as follow:
Context of Data
Data Cleaning (a.k.a data preprocessing)
Exploratory Data Analysis
Let's get started and have fun!
Context of Data
In this post, we'll investigate the E-Commerce dataset obtained from Kaggle. Before dealing with the dataset, let's try to understand what it is about to give us a better understanding of its context.
In short, the dataset consists of transactional data with customers in different countries who make purchases from an online retail company based in the United Kingdom (UK) that sells unique all-occasion gifts. The information is summarized as below:
Company - UK-based and registered non-store online retail
Products for selling - Mainly all-occasion gifts
Customers - Most are wholesalers (local or international)
Transactions Period - 1st Dec 2010‚??9th Dec 2011 (One year)
We all know data in the real world is messy (including Kaggle!) and thus, let's spend some time to clean the data to the format we need. Below is a snapshot of what the original data looks like after loading the dataset into a data frame.
As intuitive as the variables (column names) may sound, let's take a step further by understanding what each variable means:
InvoiceNo (invoice_num): A number assigned to each transaction
StockCode (stock_code): Product code
Description (description): Product name
Quantity (quantity): Number of products purchased for each transaction
InvoiceDate (invoice_date): Timestamp for each transaction
UnitPrice (unit_price): Product price per unit
CustomerID (cust_id): Unique identifier each customer
Country (country): Country name
NOTES: Product price per unit is assumed to follow the same currency throughout our analysis
So far, so good. We see that there are some missing values for Customers ID and Description. The rows with any of these missing values will, therefore, be removed.
By understanding the data in a more descriptive manner, we notice two things:
Quantity has negative values
Unit Price has zero values (FREE items?)
At this stage, we'll just remove Quantity with negative values - this notebook explains what negative values mean - and Unit Price with zero values will be explained in the later part.
To calculate the total money spent on each purchase, we simply multiply Quantity with Unit Price:
amount_spent = quantity * unit_price
Finally, we add a few columns that consist of the Year_Month, Month, Day and Hour for each transaction for analysis later. The final data frame will look like this:
Exploratory Data Analysis
The highest number of orders and money spent on purchases
In the E-Commerce world, we often want to know which customers - where they come from-place the most orders and spend the most money as they drive the sales of companies.
From the results, we observe that most orders are made in the UK and customers from the Netherlands spend the highest amount of money on their purchases.
How many orders (per month)?
Overall, we consider that the company receives the highest number of orders in November 2011 since we do not have the full month of data for December 2011.
How many orders (per day)?
Surprisingly, there are no transactions on Saturday throughout the whole period ( 1st Dec 2010‚??9th Dec 2011). Reasons behind are left for discussion as the dataset and its context are limited.
We also spot a trend where the number of orders received by the company tends to increases from Monday to Thursday and decrease afterward.
How many orders (per hour)?
In terms of hours, there are no transactions after 8:00 pm until the next day at 6:00 am.
Besides, we notice that the company receives the highest number of orders at 12:00 pm. One of the reasons could be due to the fact that most customers make purchases during lunch hour between 12:00 pm - 2:00 pm.
Discover transactional patterns for Unit Price
Before we move our attention to the zero values (FREE items) of unit price, we make a boxplot to check the distribution of the unit price for all products.
We observe that 75% of the data has a unit price of fewer than 3.75 dollars - which indicates most products are relatively cheap. Only minority of them has high prices per unit (Again, we assume each price per unit follows the same currency).
Well‚?¶ FREE items for purchase? YES, maybe‚?¶
From the plot, the company tends to give out FREE items for purchases occasionally each month (except June 2011).
However, it is not clear what factors contribute to giving out the FREE items to the particular customers. The more in-depth analysis could be done for further explanation. Let me know if you have found out the reasons behind!
Discover transactional patterns for each Country
Top 5 countries with the most number of orders
As expected, the company receives the highest number of orders in the UK (since it is a UK based company).
To better discern the trend, UK is removed for clearer comparison among other countries. As a result, the TOP 5 countries (including the UK) that place the highest number of orders are as below:
Top 5 countries with the highest money spent
As the company receives the highest number of orders from customers in the UK, it is natural to see that customers in the UK spend the most on their purchases.
Same as before, UK is removed for clearer comparison among other countries. The TOP 5 countries (including the UK) that spend the most money on purchases are as below:
Results from EDA
The customer with the highest number of orders comes from the United Kingdom (UK)
The customer with the highest money spent on purchases comes from the Netherlands
The company receives the highest number of orders from customers in the UK (since it is a UK-based company). Therefore, the TOP 5 countries (including the UK) that place the highest number of orders are as follow - United Kingdom, Germany, France, Ireland (EIRE), Spain
As the company receives the highest number of orders from customers in the UK (since it is a UK-based company), customers in the UK spend the most on their purchases. Therefore, the TOP 5 countries (including the UK) that spend the most money on purchases are as follow ‚?? United Kingdom, Netherlands, Ireland (EIRE), Germany, France
November 2011 has the highest sales. The month with the lowest sales is undetermined as the dataset consists of transactions until 9th December 2011 in December
There are no transactions on Saturday between 1st Dec 2010 - 9th Dec 2011
The number of orders received by the company tends to increases from Monday to Thursday and decrease afterward
The company receives the highest number of orders at 12:00 pm. Possibly most customers made purchases during lunch hour between 12:00 pm - 2:00 pm
The company tends to give out FREE items for purchases occasionally each month (Except June 2011). However, it is not clear what factors contribute to giving out the FREE items to the particular customers
Simply by performing EDA on the dataset, we've identified some interesting results. Of course, the results don't just stop here. They can always be used to validate business assumptions (if any) and interpret a machine learning model's output and so much more!
Remember. Creativity is your limit when doing EDA. And it really depends on your business understanding, curiosity to ask interesting questions to challenge and validate assumptions, as well as your intuition.
Thank you for reading. Hopefully, by showing the overall workflow of EDA, visualization and its results, EDA will become less intimidating to you and you'll be more interested in getting your hands dirty next time.
If you find this article useful feel free to share it with others or recommend this article!