The deep learning revolution has ushered in a new generation of machine learning tools capable of identifying the patterns in massive noisy datasets with accuracy that often exceeds that of human domain experts. In turn, as machines have achieved human or even superhuman accuracy across an increasing number of tasks, we have increasingly described them in the same terms we describe ourselves, as silicon incarnations of life, learning about the world. Yet, a critical distinction between machines and humans is the way in which we reason about the world: humans through high order semantic abstractions and machines through blind adherence to statistics.
The problem with likening machine learning to human learning is that when humans learn, they connect the patterns they identify to high order semantic abstractions of the underlying objects and activities. In turn, our background knowledge and experiences give us the necessary context to reason about those patterns and identify the ones most likely to represent robust actionable knowledge. In contrast, machines blindly search for the strongest signals in a pile of data. Lacking a cushion of background knowledge or life experiences to understand the meaning of those signals, deep learning algorithms cannot distinguish between meaningful and spurious indicators. Instead, they merely blindly encode the world according to statistics, rather than semantics.
A human, shown a collection of photographs of dogs running around dog parks and house cats running around people's homes, is able to use their life experiences to understand that in this particular context, the background of the photograph is not relevant to categorizing whether the photo depicts a dog or a cat. A machine learning algorithm, on the other hand, might recognize that the strongest signal differentiating a dog from a cat is whether the photo is a bright outdoor photo or a dim indoor photo. If the machine's goal is to maximize its accuracy at distinguishing dogs and all of the dog photos are outdoors, then this is indeed the best signal for the machine to latch onto. However, in doing so, the machine will necessarily fail when shown images of large outdoor cats like lions.
This is why diversity in training data is so critical, to ensure the machine sees a wide range of counterexamples that cancel out spurious patterns. In effect, one is rearranging the room such that the blindfolded algorithm can more likely find its way to the door, rather than giving the algorithm the necessary sight to navigate on its own.
Unfortunately, as every deep learning researcher knows the hard way, not all datasets have sufficient diversity to cancel out all possible spurious signals. Devising balanced training datasets also requires the creativity to anticipate every possible pattern the machine might latch onto and manually proactively cancel them all out ahead of time with counterexamples. Though, here again, the ability of machine learning algorithms to reach deeply into a dataset's most subtle patterns means it is nearly impossible to counter every possible spurious signal a machine might find.
Today's computer vision algorithms are getting better at this by segmenting images into objects and performing recognition at the object level, but this merely addresses some of the symptoms, rather than curing the problem.
Machine learning algorithms still largely lack the ability to reason about the interrelationships of the objects in an image. Most importantly, their reliance on primitive patterns rather than semantic abstractions means they lack our own external world knowledge to perform disambiguation using information outside that image.
A machine trained on photos of outdoor dogs and indoor cats will likely fail to correctly categorize an outdoor cat like a lion because it is unable to separate objects from their contexts. In contrast, context is often critical in recognition tasks. Take the example of a photograph of the main battle tank. Seen in a dessert of sand, most humans would label it as a tank. Yet, if the camera is zoomed out to show that dessert of sand happens to be a backyard sandbox with a young child's hand holding the tank, we as humans are able to use that surrounding context to recognize that the object in question is a toy tank rather than the real thing.
While a machine today could be readily taught that difference by being shown lots of examples of children holding toy tanks, the end result would be extremely brittle compared to our own reasoning based on scale. Given enough examples, the machine could learn basic context, that patterns and textures associated with children in an image should change the label of an object from a real tank to a toy tank. However, when shown the same toy tank in a driveway in front of a house with no humans around, the machine is likely to go right back to labeling it as the real thing.
In short, the patterns learned by machines are extremely brittle surface-level observational visual characteristics of the input data, rather than the higher order abstractions that would permit them to connect those observations to external domain knowledge and properly reason about what they are seeing.
In essence, they are akin to a human shown patterns in a pile of numbers and asked to flag future occurrences without any understanding of what those numbers represent or what the decision involves. This is one of the reasons that current deep learning systems have been so brittle and easy to fool despite their uncanny power. They search for correlations in data, rather than meaning. Therein lies the great contradiction of deep learning today: machines which perform at superhuman levels much of the time, only to spectacularly fail at any moment in the most unexpected and counterintuitive ways.
This is also what makes life-and-death applications of machine learning like driverless cars so frightening, in that we have no idea what basic scenario might cause them to suddenly fail, potentially with fatal consequences.
How do we fix this?
One option is to build machine learning systems that are able to succinctly describe the patterns they learn such that a human domain expert can review and approve each pattern. Such a tag team approach would couple the extraordinary pattern recognition of machines with the domain knowledge of humans.
On the other hand, one of the reasons machine learning has been so successful compared to humans in many domains is its ability to spot the very kinds of subtle or unexpected patterns that would appear as spurious noise to a human, but which are legitimate signals that simply had been undiscovered by humans. This is especially the case in theory-driven fields like population-scale human behavior in which observational data has historically been sparse or entirely unavailable, leading to theories that miss all the signals we have subsequently found to be the most predictive. Having humans review the patterns identified by machines would be counterproductive in such situations, with those humans discarding all of the patterns that did not fit with their (typically incorrect) theories.
Putting this all together, over the past half-decade, machine learning has experienced a renaissance in which deep learning approaches have broken through many of the formerly intractable tasks like computer vision and yielded superhuman accuracy in an increasing number of domains. Yet, the extreme brittleness of those solutions and the inability to know where they might spectacularly fail to pose unique challenges as deep learning finds applications in life-and-death domains like driverless cars and medicine. The underlying problem is that today's machine learning algorithms merely blindly learn statistical patterns in data without any understanding of whether those patterns actually legitimately relate to the task at hand. Only once deep learning systems can reason about the world in terms of semantics, rather than statistics and incorporate external world knowledge and context into their decision-making processes, will we finally have machines that can begin to shed some of the brittleness and failures of the current generation of AI.