This is a very simple post aimed at sparking interest in Data Analysis. It is by no means a complete guide, nor should it be used as complete facts or truths.
I'm going to start today by explaining the concept of ETL, why it's important, and how we're going to use it. ETL stands for Extract, Transform, and Load. While it sounds like a very simple concept, it is very important that we don't lose sight during the process of analytics and remember what our core goals are. Our core goal in data analytics is ETL. We want to extract data from a source, transform it by potentially cleaning the data up or restructuring it so that it is more easily modeled, and finally load it in a way that we can visualize or summarize it for our viewers. At the end of the day, the goal is to tell a story.
Let's get started!
But wait, what are we trying to answer? What are we trying to solve? What can we calculate and/or show in order to tell a story? Do we have the data or the means necessary to be able to tell that story? These are important questions to answer before we get started. Usually, you're an experienced user on a certain database. You have a strong understanding of the data available to you, and you know exactly how you can pull it, and modify it to fit your needs. If you don't you may need to focus on that first. The worst thing you can do, and I'm very guilty of it at times, is getting so far down the ETL trail only to realize you don't have a story or no real end game in mind.
Step 1: Define a clear goal
and map out the way you're going to succeed. Focus on every step of the process. What are we going to use to extract the data? Where are we going to extract it from? What programs am I going to use to transform the data? What am I going to do once I have all the numbers? What kind of visualizations will emphasize the results? All questions you should have answers to.
Step 2: Get Your Data (EXTRACT)
This sounds a lot easier than it actually is. If you're more of a beginner, it's going to be the hardest obstacle in your way. Depending on your use there are typically more than 1 way to extract data.
My personal preference is to use Python, which is a scripting programming language. It is very strong, and it is used heavily in the analytic world. There is a Python distribution called Anaconda that already has a lot of tools and packages included that you will want for Data Analytics. Once you've installed Anaconda, you'll need to download an IDE (integrated developer environment), which is separate from Anaconda itself, but is what interfaces with the programs itself and allows you to code. I recommend PyCharm.
Once you've downloaded all of the things necessary to extract data, you're going to have to actually extract it. Ultimately, you have to know what you're looking for in order to be able to search it and figure it out. There are a number of guides out there that will walk you more through the technicalities of this process. That is not my goal, my goal is to outline the steps necessary to analyze data.
Step 3: Play With Your Data (TRANSFORM)
There are a number of programs and ways to accomplish this. Most aren't free, and the ones that are, aren't very easy to use out of the box. This stage should ordinarily be one of the quicker stages of the process, but if you're doing your first analysis, it's likely going to take you the longest, especially if you switch product offerings. Let's go ahead and go through all of the different options that you have, starting with free (or close to it), and moving on to more expensive and infeasible options if you're a complete noob.
Qlikview - there is a free version. It is essentially the full version, the only difference is that you lose some of the enterprise functionality. If you're reading this guide, you don't need those.
Microsoft Excel - I can't really promote this software enough. If you're a student you likely already own this software. If you're not, but you don't know Excel, you should consider investing because knowing Excel is usually good enough to get a job somewhere doing something.
R/Python - These are a lot more difficult for data manipulation. If you're capable of using this software for these purposes you are absolutely not reading this guide.
Depending on the particular project you're working on there are different ways to transform your data. Text analytics is far different from other forms of analytics. Each form of analytics is its own beast, and I could probably write 10 pages in-depth on each kind, the issues you run into and ways to solve them, so I will not be doing that in this particular article.
Step 4: Visualize (Load)
This step is essentially the step that involves displaying it to your user. Depending on your role in the process, this can be completely different. If there is someone that is going to dissect the data you give them, you're likely not going to create any visualizations. However, you might create models that allow the end-user to look at the data and understand it a lot easier, or easier for them to manipulate. This is, in my opinion, the most important step regardless of what your role is in an ETL process.