Machine learning algorithms face two main constraints: Memory and processing speed.
Let's talk about memory first, which is usually the most limiting constraint. A modern PC typically has something like 16 GB RAM. Consequently, it can load datasets up to a few GBs in memory, which means millions, if not billions, of data points. For many machine learning tasks, this is more than enough.
To increase beyond that, you can use techniques like stochastic gradient descent and update the machine learning model in mini-batches. That is, you can load a small portion of the dataset into memory, update the model, then load another portion into memory and so on. The constraint is no longer main memory, but rather disk storage. Computers typically have hard disks of a few hundred GBs up to a couple of TBs.
You could go even further by streaming data over the network and keep processing it in mini-batches, which means there is essentially no limit to the amount of data you can handle. However, processing speed would have taken over as the main constraint long before you get that far. You can speed things up considerably by moving computation from the CPU to a faster GPU, but when we're talking about TBs of data, it will be extremely time-consuming to train a machine learning model even on a state-of-the-art GPU.
Up until this point, the machine learning algorithms have pretty much remained exactly the same. But once a single computer is no longer enough to handle the data volume, the only option is to distribute the workload across multiple servers. That means that a few computational tricks are needed to parallelize the algorithms, but they're essentially still the same. The parallelization is handled by big data frameworks, and the machine learning engineer only needs to make the right API calls.
originally appeared on Quora
- the place to gain and share knowledge, empowering people to learn from others and better understand the world.