A Data Scientist's Guide to Data Structures & Algorithms, Part 1

By Kimberly Cook |Email | Sep 1, 2018 | 24003 Views

In data science, computer science and statistics converge. As data scientists, we use statistical principles to write code such that we can effectively explore the problem at hand.

This necessitates at least a basic understanding of data structures, algorithms, and time-space complexity so that we can program more efficiently and understand the tools that we use. With larger datasets, this becomes particularly important. The way that we write our code influences the speed at which our data is analyzed and conclusions can be reached accordingly. In this post, I will describe Big O notation as a method for describing time-space complexity and briefly go over some algorithms that relate to time complexity. In a later post, I will discuss algorithms that relate to space complexity.

Big O Notation
In programming, an algorithm is a process or set of rules to be followed in order to achieve a particular goal. An algorithm is characterized by its running time (run-time), whether in terms of space or time. As data scientists, we are interested in the most efficient algorithm so that we can optimize our workflow.

In computer science, Big O notation is used to describe how 'fast' an algorithm grows, by comparing the number of operations within the algorithm. This will be explained in further detail later on but for now, let's understand all of the formal notation.

Formal Notation
Big Ω: the best-case scenario. The Big Ω of an algorithm describes how quickly an algorithm can run under the best of circumstances.
Big O: the worst-case scenario. Typically, we are most concerned with the Big O time because we are interested in how slowly a given algorithm will run, at worst. How do we essentially make the 'worst-case' not as bad as it could be?
Big θ: this can only be used to describe the run-time of an algorithm if the Big Ω and the Big O are the same. That is, the algorithm's run time is the same in both the best and worst cases.
Because we are most concerned with the Big O of an algorithm, the rest of this post will only focus on Big O.

How do we use Big O to describe an algorithm? Suppose you wish to search for someone's name in a phone book.

Phone books: before Google was a thing

What's the most straightforward way of finding this person? Well, you could go through every single name in the phone book until you find your target. This is known as a simple search.

If the phone book is very small, with only 10 names, this is a fairly fast process. But what if there are 1,000 names in the phone book?

At best, your target's name is at the front of the list and you only need to need to check the first item. At worst, your target's name is at the very end of the phone book and you will need to have searched all 1000 names. As the 'dataset' (or the phone book) increases in size, the maximum time it takes to run a simple search also linearly increases.

In this case, our algorithm is a simple search. Big O notation allows us to describe what our worst case is. Our worst case is that we will have to search through all elements (n) in the phone book. We can describe the run-time as:

O(n) where n: number of operations
Because the maximum number of operations is equal to the maximum number of elements in our phone book (you might need to search through them all to find your target's name), we say the Big O of a simple search is O(n). A simple search will never be slower than O(n) time.

Different Big O Run Times
Different algorithms have different run-times. That is, algorithms grow at different rates. The most common Big O run-times, from fastest to slowest, are:

  • O(log n): aka log time
  • O(n): aka linear time
  • O(n log n)
  • O(n²)
  • O(n!)

The Big O cheatsheet is also very useful for a quick graphical representation of the different run times and how they compare to each other.

A graphical representation of common Big O run times (credit: http://bigocheatsheet.com/)
In this post and its following post, I will describe common algorithms which are described by these different run-times.

Here are some principles that are important to understand before discussing some of the common algorithms.

Recursion: Recursion is when a function calls itself. Perhaps the quintessential example of recursion is in implementation of a factorial function:

def factorial(n):
    if n < 1:        #base case
        return 1
    else:             #recursive case
        return n * factorial(n-1)

The function is called within the function itself and will continue calling itself until the base case (in this case, when n is 1) is reached.

Divide and Conquer (D&C): A recursive approach for problem-solving, D&C (1) determines the simplest case for the problem (AKA the base case) and (2) reduces the problem until it is now the base case.

General overview of divide & conquer technique (credit: http://bigdata.ices.utexas.edu/project/divide-conquer-methods-for-big-data-analytics/)

That is, a complex problem is broken down into simpler sub-problems. These sub-problems are solved and their solutions are then combined to solve the original, larger problem.

Common Algorithms
Search and sort algorithms are perhaps the most important algorithms to first understand.

Simple Search
This was described earlier with the phone book example, where the worst case would require that you search through all the names in the phone book before you find the name of interest. In general, simple search has a O(n) time. The maximum time required is linearly related to the number of elements in your list.

Binary Search
Let's stick with the phone book example. We're still interested in finding someone's name in the phone book, only this time we're going to try to be more efficient about it. Instead of tediously going through each and every name in the phone book, we're going to start in the middle of the phone book and go from there.

Say our target's name begins with an P. We open to the Ms which is roughly in the middle of the alphabet. We know M is earlier than P in the alphabet, so we can eliminate the section from A to M. Now we can look at the later half of the phone book (N to Z), split that section in the middle (to the Ts), and compare to our target. T is later in the alphabet than P. We then know to eliminate the later half (T to Z). We focus on N to S now, dividing this in half and so on until we find our name of interest.

Generally, in binary search, you take your sorted (this is important) data and find the midpoint. Each time, you compare your target to the middle value. If the target value is the same as the middle value, then your job is done. Otherwise, you know which half of the list to eliminate based on the comparison. You continue dividing until the target is found or the dataset can no longer be halved.

Diagram of binary search with a list of numbers
Because binary search involves the halving of your dataset, the Big O time is O(log n). As such, it is faster than simple search, especially as your dataset grows (the algorithm's growth is not linear but logarithmic so it grows slower, relative to a linear run-time of O(n)).

As an aside, binary search can be written recursively but is not considered a D&C algorithm. Although the larger input is indeed broken down into subsets, these subsets are ignored if they do not contain the value of interest. Solutions are not produced for these subsets so that they can be combined to solve the larger input.

Selection Sort
Much like simple search for search algorithms, selection sort is perhaps the most straightforward, 'brute force' way to sort your data. Essentially, you go through every element in your list and append each element to a new list in your desired order. For example, if you are interested in sorting a list of numbers from greatest to smallest, you would:

  1. Search through the list to find the largest number
  2. Add that number to a new list
  3. Go to the original list, search through it again to find the next largest number
  4. Add that number to the new list and so on...

For selection sort, you have to go through each item in the list (this takes n times, just as it would for a simple search) and you have to do this n times (not just once, because you have to keep going back to the original list to find the next item you want to add to the new list). Thus, this takes O(n²) time.

How would quicksort differ from selection sort? If we work with a list of numbers, just as before:

  1. Pick an element from your list, known as the pivot. The selection of a pivot is important in determining how quickly a quicksort algorithm will run. For now, we can select the last element each time as the pivot. (For additional information on pivot selection, I recommend the Stanford Coursera algorithms course.)
  2. Partition the list so that all numbers smaller than the pivot are to its left and all numbers greater than the pivot are to its right.
  3. For each 'half' of the list, you can treat it as a new list with a new pivot and rearrange each half until it is sorted.

Diagram of quicksort with a list of numbers
Quicksort is an example of a D&C algorithm because it divides the original list into smaller and smaller lists which are ordered. These smaller, ordered lists are then combined to result in a larger, ordered list.

Quicksort is unique because its speed is dependent on the pivot selection. At worst, it can take O(n²) time, which is as slow as selection sort. However, if the pivot is always some random element in the list, quicksort runs in O(n log n) time on average.

Assume we are still working with our list of numbers. For the merge sort algorithm, the list would be broken down into its individual elements. Ordered pairs are then created from these elements (with the smaller number to the left). These ordered pairs are then grouped into ordered groups of four and this continues until the final merged, sorted list is created.

Animated overview of mergesort algorithm (credit: By Swfung8 - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=14961648)
As with quicksort, mergesort is a D&C algorithm because the input list is broken down and sorted, before being combined to produce an ordered version of the larger, original list.

Mergesort runs on O(n log n) time because the entire list is halved (O(log n)) and this is done for n items.

Knowledge of algorithms and data structures is useful for data scientists because our solutions are inevitably written in code. As such, it is important to understand the structure of our data and how to think in terms of algorithms. In a future post, I will describe common data structures, space complexity, and common related algorithms.

Additional Helpful Resources:

The article was originally published here

Source: HOB