It has been a while even since I posted on Medium. Having been in Data Science for almost half a year, I've made a lot of mistakes and learned from the mistakes along the way‚?¶ through the hard way.
There's no failure, only feedback.
And the real world is a feedback mechanism.
And YES you're right, the learning journey wasn't easy. Just keep grinding. LEARN and IMPROVE.
Through my learning experience, I have finally come to realize that there are a few common pitfalls that most beginners (like me) in Data Science would probably encounter. If you did, I hope the 5 biggest lessons that I have learned from these pitfalls would guide you through your journey. Let's get started!
1. Business Domain Knowledge
To be honest, this lesson hit me hard right in my face when I first started as I did not put much emphasis on the importance of domain knowledge. Instead, I spent too much time on improving my technical knowledge (building a sophisticated model without really understanding the business needs).
Without understanding the business thoroughly, chances are your model would not add any values to the company as it simply doesn't serve the purpose, regardless of how accurate your model is.
Most common technique to improve a model accuracy is Grid Search
to search for the best parameters for the model. However, only by understanding business needs and adding relevant features to train your model, then can you boost your model performance significantly. Features engineering
is still very important and Grid Search is just the final touch to improve your model.
As always, be really interested in your company's business because your job is to help them solve their problems‚??-‚??through DATA. Ask yourself if you're really passionate about what they're doing and show your empathy at work.
Always know you're talking about
Solely understanding the business itself is not sufficient, not until you can articulate your ideas and present to other colleagues/stakeholders in the terms that they could understand in the business context.
In other words, NEVER use strange (or probably self-defined) words that are not familiar to stakeholders as this would only give rise to misunderstanding between you and them.
Despite your findings may be correct or insights may be impactful, your credibility would be questioned and your findings would be nothing but a debatable subject.
Before you show how data can be used to solve business problems, I'd suggest to first show that you understand the business as a whole (including technical terms that are commonly used in your day-to-day work) and subsequently identify a problem statement that could be answered by the available data.
2. Detail-Oriented Mindset and Workflow
Be like a Detective. Carry out your investigation with laser focus on details. This is particularly important during the process of data cleaning and transformation. Data in real life is messy and you must have the capability to pick up signals from the ocean of noise before you get overwhelmed.
Therefore, having a detail-oriented mindset and workflow is of paramount importance to be successful in Data Science. Without a meticulous mindset or a well-structured workflow, you might lose your direction in the midst of diving into exploring your data.
You may be diligently performing Exploratory Data Analysis (EDA)
for some time but still may not have reached any insights. Or you may be consistently training your model with different parameters to hopefully see some improvement. Or perhaps, you may be celebrating the completion of arduous data cleaning process, when the data could in fact be not clean enough to feed to your model. I'd been through this aimless process, only to realize that I did not have a well-structured workflow and my mind was simply hoping for the best to happen.
Hoping for the best to happen simply left me no control of what I was doing. The system was disordered and I knew something was wrong.
I stepped back to look at a bigger picture on what I'd been doing; I reorganized my thoughts and workflow, trying to make everything standardized and systematic. And it worked!
3. Design and Logic of Experiment
A systematic workflow gives a macroscopic view of your whole data science prototyping system (from data cleaning to interpreting model results etc.); an experiment is an integral part of the workflow that includes your logic for hypothesis testing
as well as model building process.
Normal machine learning problems (Kaggle competition etc.) are straightforward as you can just get training data and start building your model.
However, things get complicated in real world in terms of framing your logic and designing an experiment to test your assumption and evaluate your model with suitable success metrics.
At the end of an experiment, every claim or conclusion should always be supported by facts and data. NEVER conclude something without verifying its validity.
4. Communication Skills
If there's only one takeaway from this post, I hope you can always strive to improve your communication skills. It doesn't matter if you are a beginner, intermediate or an expert in Data Science.
Promise me one thing‚??-‚??that you'd share your thoughts to others while attentively listening to their opinion at the same time. Be receptive of criticism and feedback.
Speak the language of the business and communicate with colleagues, managers and other stakeholders with the terms that they understand. This resonates with the importance of the first lesson‚??-‚??Business Domain Knowledge. Failure to grasp the language of the business would render your communication with team members less effective as people may have a hard time to understand your words from their point of view.
As a result, time gets wasted; people get frustrated; your credibility and relationship with them would likely get affected. What a lose-lose situation!
Even worse, lack of communication skills would cause business stakeholders to face challenges in understanding your analysis results. Always communicate your ideas, approach, results and insights in a simple manner despite the complexity behind. Simply speaking, if you speak a business language to business people, they feel more comfortable, they feel empowered and are much more willing to invest their time into the process, leading to more active participation in the conversation to understand your analysis. This also leads to the importance of the last lesson‚??-‚??Storytelling.
If it wasn't obvious by now, Data Science isn't just about data crunching and models building to showcase results to stakeholders. With the stellar performance of your model that can meet business needs, your end goal should be to deliver your results to stakeholders through compelling data storytelling that can answer some of the following questions (depending on your project goals):
- Why do we have to analyze it?
- What insights can we obtain from the results?
- What decisions/action plans can we make out of it?
Imagine your are the stakeholder, what makes a compelling and convincing storytelling?
Let's sit back and relax. Imagine again when there is a data scientist now showing you a highly accurate model prediction for a business problem without further explanation. You might think: Impressive! The model is doing a great job‚?¶ So what is next? And then?
Do you get what I'm trying to portray here? There is a definite gap between model results and action plans. Stakeholders wouldn't know what to do even though you just show them a highly accurate model prediction. We have to bridge the gap by thinking from their perspective to answer their questions and concerns instead of solely meeting the business objectives to ultimately lead to action plans.
There are many ways of bridging the gap and I'll briefly highlight two approaches that can provide illuminating insights and guide stakeholders to their action plans.
Set a benchmark for comparison
It is insufficient to claim that a model performance is good without having something to compare with. In other words, a benchmark is needed as the baseline so that we know if the model does a great job or the other way round.
Without this benchmark, it is practically meaningless to claim that a model performs well as there is still a question left unanswered: How good is considered good enough? Why should I believe your results?
This is especially important as it will decide if your model will be pushed into production. It means that you have to show the BEST and WORST case scenarios from the model performance.
This is where risk management comes in because stakeholders want to know the model limitation where it works and where it fails. They want to know how much risk the company has to bear when the model is pushed into production which can eventually affect their final action plans.
Therefore, understanding the importance of risk management will not only make your results more compelling, but also increase stakeholders' confidence in you and your outcome substantially (since you have helped the company to manage and minimize risk).
The article was originally published here