An Interview with Sergey Nikolenko, Chief Scientist of Neuromation, a Blockchain for Artificial Intelligence company.
The revolution is long overdue: deep learning employs artificial neural networks of extremely large capacity and, therefore, requires highly accurate labeling. Collecting large datasets of images, text and sound is easy, but describing and annotating data to make it usable has traditionally been challenging and costly. Crowdsourcing was applied to the problem of dataset creation and labeling a few years ago, employing large numbers of humans to correct mistakes and improve accuracy. It proved slow, expensive and introduced human bias. Besides, there were tasks that humans simply could not do well, such as estimating distances between objects, quantifying lighting in a scene, accurately translating text, and so on.
Sergey has been a researcher in the field of machine learning (deep learning, Bayesian methods, natural language processing and more) and analysis of algorithms (network algorithms, competitive analysis), Sergey has authored more than 120 research papers, several books, courses "Machine learning", "Deep learning", and others. Extensive experience with industrial projects (Neuromation, SolidOpinion, Surfingbird, Deloitte Analytics Institute).
What are the differences between what Neuromation is doing and what Ben Geortzel is doing ?
As far as I could understand from open information, SingularityNET aims to create a decentralized marketplace for "AI models". The emphasis of SingularityNET is on the interaction between different AI models while we make an emphasis on training and data generation for the models. Moreover, SingularityNET makes no attempt to explain where the computational resources and data for training these "AI models" will come from, and the training process itself appears to be completely outside of the loop of SingularityNET. The Neuromation platform is designed to solve these problems: we bring together synthetic data generation as a possible solution for the data problem and tap into the vast computational resources currently being used for mining cryptocurrencies for the training problem.
On the other hand, I strongly agree with Ben Goertzel's sentiment when he says that "I don't think that... a few companies essentially owning AI... is best for humanity". We at Neuromation also strongly believe in democratizing AI research and empowering small businesses and individual AI researchers to stay on the cutting edge and not be left out of the loop by the sheer amounts of data and computational resources available to the big players. I will talk more about that later.
What are the milestones that Neuromation expects over the 12-36 months ?
By Q1 2018, we plan to open up our platform with the explicit goal to democratize AI. We expect that in Q1 2018 we will have a working platform where AI practitioners will be able to order and utilize computational resources from the miners. But on the provider side we expect that it will be still limited to a few select partners with large mining pools.
By Q4 2018, we plan to release a fully distributed computational marketplace for synthetic data generation, model training, and model deployment, opening it for the worldwide mining pool community.
In 2019, we are going to develop an auction platform on blockchain so that autonomous agents are able to place orders and fulfill contracts for synthetic data generation and AI model training without human intervention. We believe that the Neuromation platform will provide an important stepping stone for the emergent machine-to-machine economy, where independent AI agents will be able to trade for data, computational power, and other resources.
By 2020, we plan to allow human agents to appear in the platform to help with simulation and data generation. For example, we envision sandbox environments where a human can contribute some kind of behaviour that will still be hard to generate procedurally (e.g., self-driving cars with reckless drivers). Perhaps, when sufficiently many self-driving car agents feel the need to explore a new subset of the search space, they will place a collective order that could be fulfilled by humans (modeling human drivers in the simulation).
What is the growth rate for the expansion of labelled data ? vs unlabelled data ?
If I understood the question correctly, the answer is as follows: unlabeled data is vast, basically unlimited; it surrounds us. All you see around you is unlabeled data. You can take millions of pictures, thousands of hours of video and so on at very little cost. Unlabeled data is only limited for applications with very specific kinds of data - e.g., in biomedical applications even unlabeled cardiograms, sequenced genomes, or imaging mass-spectrometry datasets can be very expensive. For more general tasks such as computer vision or speech recognition there is more unlabeled data than we could ever hope to process.
On the other hand, labeled data is usually very limited and very expensive. For example, if you need to train a neural network for object recognition, your dataset must contain images with the objects you plan to recognize labeled by bounding boxes on a lot of images in different environments. It is a lengthy and expensive process to produce millions of such labeled images by hand for a large-scale object recognition task.
The lack of labeled data is exactly the bottleneck that we at Neuromation are currently trying to solve with synthetic data.
Does Neuromation see a roadmap to full AI automation for in the wild training of AI. The recent upgrade to Alphago starts from random learning and then plays against itself for an improved result versus human inputs and reviewing human games ?
We see the first steps in this direction being taken right now. AlphaGo Zero is not one of them, by the way: it is a very specific network for a very specific problem, and the takeaway point from AlphaGo Zero is that you can learn to play Go very well even without labeled data. This point may generalize to chess or DotA 2, but it does not generalize to, say, object recognition. In Go or chess, you can play a game and know the result, and this result is your objective function; in object recognition, there is no objective function at all if you don't know what you are trying to recognize.
I see the real first steps in fully automated AI training in two main directions. First, the autoML initiative where people train models to automatically adapt the architecture of other models for specific problems. If this approach fully succeeds, you will be able to basically have a big red "Train" button that will automatically choose and adapt a model architecture to your specific dataset.
Second, to get to automated AI we would need to make good progress on domain adaptation and learning transfer, reusing the same neural networks for very different tasks like we probably do in our brains. One interesting recent result here is PathNet by DeepMind, a large modular deep learning architecture able to automatically mix and match individual neural networks to fit a specific task. In this way, it is able to solve very different tasks (e.g., supervised learning for image recognition and reinforcement learning for games) with the same basic architecture.
I feel that these two directions may bring exciting advances and may be the first steps on the road to fully automated AI training in the wild.
Are there tipping points within various domains of knowledge where there is a critical core of knowledge ? ie 1 million medical records combined with 1 million full genome sequences with 1 million metabiomes.
I am very wary of flaunting large numbers around; for me it is usually a huge red flag when people do that. We can clearly see the general trend: more data often brings new insights and enables better predictions that would have been impossible on smaller datasets. But we cannot know exactly how much data we will need to make a new breakthrough in advance, before the breakthrough actually happens. I would be very happy to read the reasoning behind some of the numbers people put forward, but I suspect often there is no reasoning at all, or the numbers are projections of how much data we will have, not how much will be needed for new critical breakthroughs...
The first AI winter happened in the 1960s in part because the governmental support for AI projects in the U.S. was significantly reduced. And that happened because when Frank Roseblatt designed and implemented the first perceptron (a very simple linear model that trains a single linear decision boundary), everybody and the New York Times were talking like strong AI is just around the corner. People actually tried to make a fully automated machine translation model in the late 1950s - a project that we now see was doomed from the start. Back in the 1950s, it was not obvious that it was doomed... but they shouldn't be making huge promises either.
The second AI winter started in late 1980s. By that time, people learned to make generic neural networks and train them by backpropagation but there was not enough data, not enough computational power, and actually some mathematical ideas needed to train deep neural networks were also missing. But people in the 1980s were again talking like strong AI was very close, again with no real justification for this kind of predictions, and a big disillusionment was inevitable.
So hype is generally a good thing... until it's not. Let us be careful in our predictions so that we do not reach the tipping point for the third AI winter.
What are some other topic within Blockchain, cryptocurrency and AI that you feel are relevant to your company and your industry in general?
Obviously, any kind of deep learning or generally AI advances are potentially important for the Neuromation platform. I feel that the most important ones, as I have already briefly mentioned, will be advances to more automated machine learning.
Suppose that our vision comes to life, and in a few years we have hundreds of thousands of machines that had been previously used for mining cryptocurrencies or simply underused, and they are all generating tens of thousands of synthetic datasets and training tens of thousands of neural networks on them. One very relevant problem, e.g., will be to develop agents that will be able to automatically choose architectures for different tasks, similar to the autoML research direction I mentioned before.
First, they will make it easier to train your own AI models, which lines up with our vision of democratizing AI and enabling everyone with cheaper computational resources. Second, I expect that automated ML will require even more computational resources (to choose the best architecture, you probably need to train at least several of them, at least up to a certain point), with a tradeoff between how self-contained it is and how much computation and/or data it requires. We can help shift this tradeoff towards more fully automated AI models by providing cheap and reliable ways to generate training data and get computational resources for training.
The main blockchain-related challenge is to develop a large distributed blockchain-based auction that will operate very fast and will connect all our AI needs: synthetic data generation, model training, and model deployment. We plan to develop a machine-to-machine auction that will be able to execute very quickly and without human intervention. This is still an open problem and an exciting research direction. We hope to be the pioneers in this direction.
Currently one interesting thing we see on the horizon in this regard is the idea of a hashgraph that basically allows for a decentralized consensus protocol that does not require O(n^2) communication to reach consensus. The hashgraph could be adapted to create an immutable voting-based transaction system that does not have a central server but at the same time can be scaled up to millions of transactions. Perhaps this is a direction that can bring us to our machine-to-machine auction goal.
What are some other topic within Blockchain, cryptocurrency and AI that you feel are important for society ?
Blockchain itself is, of course, a very important thing for society. I do not need to tell you how blockchain-based currencies can revolutionize the economy at large, changing the lives of billions of people.
But to stay closer to Neuromation ideas, an important related topic is the rise of machine-to-machine economy. And by this I mean not only the obvious applications like a smart refrigerator ordering up food from an automated grocery store once you run out. We envision that a knowledge-based economy may arise between AI agents that wish to obtain training data for themselves or provide them to others.
Any drone, robot, or self-driving car, basically any kind reinforcement learning model may "wish" to purchase certain data/knowledge for further training. Moreover, sometimes it may obtain new knowledge that may be valuable for other agents, so that other agents will "agree" to pay for it. E.g., suppose that a self-driving car has been in a rare situation, say it has nearly avoided a deer suddenly running out of the forest. This can be very valuable information for other self-driving cars since it is hard to simulate deers dashing out on the road in training datasets.
Moreover, advanced AI agents can actually understand that this kind of data can be useful since it covers unexplored regions of their search spaces. Thus, they may "negotiate" to trade data between themselves, or spend some kind of money (probably also in the form of a cryptocurrency or token economy, like the Neurotoken economy we will have on our platform) to obtain either this rare training data or maybe even directly weight updates that are supposed to result from training on it. Or, to tie this in with the PathNet project I mentioned above, purchase a "small" neural network trained specifically on these rare datasets and add it as a component to the agent's arsenal.
This is exactly the kind of vision that we plan to empower with the Neuromation platform.