In AI, is Data more Important than Algorithms?
There is no simple clear cut solution and we can't expect a simple black or white answer to this question. Whether data or algorithms are more important has been debated at length by experts (and non-experts) in the last few years and the TLDR, is that it depends on many details and nuances that take some time to understand.
In machine learning, is more data always better than better algorithms? Recommended to start by reading that answer, which could maybe address 80% of this question and come back here. There are some slight but important differences in this question that I will address below.
First, the question I linked to refers to machine learning while this one is about Artificial Intelligence. Is that the same thing? Well, not exactly. As a matter of fact, ML is a subfield of AI where you specifically do need data to train algorithms. AI does include other approaches that are based on logic or rules and don't require data in the same way or quantity than ML. So, in other words, if we agree that it is not always the case that data is more important than algorithms in ML, it should be even less so if we talk about the broader field of AI.
What does the market think artificial Intelligence means as compared to machine learning?, Most people might not care much about the difference between ML and AI and will use them interchangeably. As a matter of fact, many people today will use AI as a synonym of Deep Learning, which is itself a particular kind of machine learning approach. So, I think it would be good to address this question from the particular viewpoint of recent advances in Deep Learning:
In modern Deep Learning approaches is data more important than algorithms?
Well, again, yes and no. It is true that these approaches are very "data-hungry". Without going into many details, deep learning algorithms have many parameters that need to be tuned and therefore need a lot of data in order to come up with somewhat generalizable models. So, in that sense, having a lot of data is key to coming up with good training sets for those approaches.
As a matter of fact, some have explained that there is a direct relation between the appearance of large public datasets like Imagenet and recent research advances. Note though that this is highlighting that, at least in some domains, the existence of public datasets makes data less of a competitive advantage.
Also, the interesting thing about some of those algorithms and approaches is that they can sometimes be "pre-trained" by whoever owns the dataset and then applied by many users. In these cases, data tends to be less of a need. An easy way to understand this is the following: If you have to train a model to translate from English to Spanish, all you need to do is gather a huge dataset and train the model once. The model itself carries all the information so anyone who can get a hold of it does not really need the original data anymore. For example, the famous 22-layer Googlenet model is available for download in different models (e.g. Keras model).
So, again, even for these data-hungry applications the answer is not always clear that you need to have huge amounts of data in order to leverage latest advances. That said, if you are trying to push the state of the art and come up with very concrete applications, yes, you will need to have internal data that you can leverage to train your cool new deep learning approach.
Andrew Ng often mentions that in deep learning, more data + larger models = Better Performance. Also different algorithms can give better results when you have a lot of data.
Let's look closer into the most important situations where you'll find yourself in a machine learning problem:
1. If your training error is close to 0 and there is some gap between your training error and test error, it's probably that you either have some overfifting and/or that you need more data.
Solution: Typically more data will also help reduce overfitting in this particular case. But it's the typical case for the need of more data. You can also try to reduce overfitting by setting up regularization, sub-sampling, reducing model complexity or using dropout (depending on your algorithm) but more data will often help too.
2. If your training error is closer to your test error, and they are both at a certain distance from 0, you have a typical situation where you have not enough features, not enough complexity in your model and/or less data.
Solution: You can try to add more features, sometimes you can, sometimes you just can't. More data however will help reduce both of your errors. In some cases you'll also have to increase the complexity of your model a bit, simpler models generalize too much, more complex models understand details better but usually tend to overfit. You'll have to find the sweetspot yourself, BUT in this case more data may help as well.
3. A combination of large training error and a gap between the training error and test terror typically begs for more data.
Solution: You can try other algorithms but usually when this happens you have less data than the minimum required for it.
However the curve is not perfect, it will either start deteriorating or improving a bit every time the sample set increases. Also, typically your model size will have to grow in complexity along with the new data. You have to monitor your loss and when it starts increasing you have to add to the overall complexity a bit in order to keep up with the new data.