Generating style-specific text from a small corpus of 2.5k sentences using a pre-trained language model. Code in PyTorch
Let's do a quick Turing Test. Below, you'll see ten machine learning project ideas. Five of them are generated by a human and five of them are generated by a neural network. Your task is to tell them apart.
Ready? (Don't overthink. Just go with your gut).
- Stock trading learner
- Predicting what is the difference in frequency in data
- Machine learning for patent classification
- Classifying the convolutional characteristics of remote - sensing images
- Face detection with random forests
- Machine learning to predict career success
- Rotation-invariant sparse coding
- The application of deep learning to fairness
- Object recognition from 3D point clouds
- Model of architecture in California
Seen the list? Take note of the five ideas that you think a neural network generated, and the five that a human generated.
*Cue drumroll* for answers.
: all the odd-numbered ideas are titles of final projects done by Stanford's CS229 Machine Learning course
, and all even numbered ideas were generated by a neural network trained on that dataset.
Yes, scroll up and take a look at the list again, compare your notes and then we'll dive into details on how these ideas were generated. How accurate were you? (Tell in the comments!)
The motivation for this project was for me to learn using a recurrent neural network (RNN) to generate quotes similar to my favorite philosophers and thinkers. I had seen many other people generating music
and even molecules using RNNs. I was pumped to do the same, but for philosophy.
From the web, I had collected about ~5000 quotes from thinkers like Camus, Nietzsche, Wittgenstein, Feynman, David Hume, Stephen Hawking, and James Carse.
What skipped my eye completely was that the projects that I took as inspiration usually had a dataset that went into millions and all I had with myself was 5000 sentences. Naively and blindly, I marched on and failed repeatedly in getting my artificial philosopher to work. Finally, after three failed experiments I got it to work, which I followed up by building a generator for machine learning ideas.
The unreasonable stubbornness of RNNs when the training corpus is small
Here's the first failed experiment I did with RNNs:
- I thought the corpus is not big enough to train a word-level language model, so I went for a character-level language model
- I trained a very simple one-layer LSTM/GRU-based recurrent network on the corpus but it didn't do well
- It output English-looking but ultimately gibberish sentences such as "I can to the something we and the can to the come to this tore to the cost range tore to the conservative"
My second failed experiment:
- I thought perhaps one layer of recurrent units isn't enough and so I experimented with two layers and also played around with higher learning rate.
- Nothing worked. I was still getting gibberish such as: "I really to test protection found that in you are for the man of there some of the more to you"
My third failed experiment:
- Because of a small corpus, I thought perhaps a generative adversarial framework will work better
- But that didn't work too and I realized that GAN for LSTMs is hard and there are papers on it but training is hard and quality of output not that good.
- My GAN after much training was even worse. It generated text that was absolute rubbish: "x11114111411141114111"
After so many failed attempts, I was VERY frustrated that my artificial philosopher will always remain a pipe dream.
My conclusion from these failed attempts was that the culprit was the small text corpus.
Perhaps 5000 quotes were not enough to generate similar quotes of good quality?
So, as my next experiment, I wanted to try pre-existing word embeddings such as word2vec
rather than forcing the network to learn the embeddings from scratch. But, before doing that, I decided to take advice on Reddit's machine learning subreddit. On the thread
that I started, someone pointed me to a poster accepted into 2018 NeurIPS conference titled: "Transfer Learning for Style-Specific Text Generation
". I followed ideas from that paper and they worked like a charm.
What is transfer learning?
Transfer learning is a simple but powerful idea. It means using an existing model that's trained on a very large dataset as a starting point and tweaking it to work well on your domain-specific dataset.
In the computer vision community, transfer learning has been used for long. The idea is to use a publicly available model such as VGG that was trained on the ImageNet dataset
with 14 million images across 20,000 categories and use activations of its last layer as the input to an additional task-specific layer. The additional layer is then trained specifically on your small, domain-specific dataset for prediction, classification or any other task.
The surprising beauty of using pre-trained models is that you for free, you get to use all the concepts that the pre-trained model has learned across millions of images across hundreds of hours of training.
A pre-trained model is compressed knowledge.
These pre-learned concepts and activations enable you to predict and classify on your small, domain-specific dataset. This magic happens due to the fact that all "natural" datasets share similar characteristics. Most images share many characteristics: from primitive concepts of shapes, lines and edges to high-level concepts such as textures and effect of light and shadows.
Without pre-trained models, you'd have to learn all these concepts from scratch and your dataset may not contain enough examples to do that. With a pre-trained model, you go straight to learning what's important and different in your domain/problem and not bothering about common things found across datasets.
Transfer learning in text and NLP using pre-trained language models
Recently, transfer learning has started cropping up in NLP and exciting possibilities have opened up. There's Google's massive BERT
model and then there's ULMFit
These language models are trained on publicly available textual corpus (such as parliamentary records, Wikipedia, etc) and implicitly encode knowledge of English. Hence, they enable machine learning tasks such as classification and prediction on the text even when your dataset is very small.
And that's precisely what I wanted! My dataset of quotes was small (~5k) and my objective was to generate new quotes with a style similar to those of my favorite thinkers.
Code for text generation from a small dataset
For generating style-specific text from my small corpus, I followed the paper linked from the Reddit thread
and that led me to FastAI's lesson on classifying IMDB
reviews using a pre-trained model. The lesson was about classification but as I was going through FastAI library's documentation
, I discovered that even generating text is made trivial thanks to helper functions in the library.
What pre-trained model do we use?
It's a 3-layer AWD-LSTM model
developed by Salesforce's research team that's trained on 100 million tokens from Wikipedia articles
. I encourage you to read more details on this specific model but a key benefit of using pre-trained models is that you can get away by not understanding the underlying details. Just like you'll most likely not care about how Pytorch and Numpy work under the hood, you can also afford to not care how AWD-LSTM works under the hood.
This level of abstraction provided by pre-trained models is truly revolutionary.
Now, anyone can assemble a state of the art deep learning model in their respective domain without requiring months and years of effort. (But it pays to know the details when you're not getting results)
Aritifical Philosopher: new philosophical insights generated by my neural network
When I ran my model, I literally couldn't believe what came out of it. In excitement, I tweeted about it:
My network said: "In the world, there is no man who is not a slave" and this sounded so too-good-to-be-true that I first checked whether it was simply repeating a memorized quote from the dataset. When I didn't find it, I googled the exact phrase to see if this idea has been expressed before. Lo-and-behold, I didn't find it on Google too.
Here is a run with 100 quotes generated by the neural network. These are not modified by me in any way. I've literally copy-pasted these from my notebook. (I'm bolding the ones which are intriguing and possibly unique).
'por las vida de los mundo de los mundo en el chi',
'the truth is that we can not be sure of what we do not know .',
'according to the standard , man is the real animal .',
"it is my business that i think that 's what i do n't know .",
'after a adventure , it was not until that point that old ideas were drawn up .',
'the lives of a finite player player must be avoided .',
'a human being is also an animal .',
'i had to turn',
"i want people to be happy but i do n't want to be ourselves .",
'there is a greater freedom for the philosophers and for the philosophers .',
'for a moment is not merely a thought , but a tragedy .',
'at this stage it is the true art .',
'i am the bridge , which is the bridge and the one that carries the bridge .',
'it is the wisdom that the world has not yet seen established at all .',
'the future is great because we love everything .',
'what is the belief in the right to man ?',
'nature is a perfect field of vision .',
'to learn to draw is to impose a law on the law of physics .',
'the feeling of absurdity : that is , as we see why we are here .',
'he who fights with monsters knows what he is .',
'no longer culture is a sign of self - mastery .',
'when the universe is rotating , i will make a very very big decision .',
'today this is probably the most so - called mystery of science .',
'it is not a matter of fact , a reason for being ashamed of love .',
'the world around me , i believe , it is the world itself .',
'the subject must always be a man who is not alone . the',
"some people n't want to be stupid .",
'the good dream is that you have to struggle for the good and the bad .',
'there is no community without the strong and just no one to live across .',
'i am not the man who is myself .',
'i felt that i had to quite cease to exist when i was touched .',
'the above principle of infinite players is perfectly good .',
'the stars are small and solitary , though they are neither great nor bad .',
'all souls are ignorance .',
'the limits of my language are always your limits .',
'the world is a world without any real purpose .',
'beyond the curve of the days become his favorite .',
'i continue to believe in this condition of life',
'here is the origin of all things .',
'we have to live and let live in order that we can create a universe .',
'a man is very much the most fertile man .',
'this world comes out of nowhere .',
'to live is to be happy .',
'the present man has a reason to be able to give birth to a dancing star .',
"it 's surprising to say that the real world had a beginning :",
'the first thing is to die at heart .',
'and how i learned to walk and dance , i must sing .',
'as long as the mind is limited by its laws , it can not be understood .',
'the weakness of the apes is not infinite , but our surprise .',
'at the end of a conversation there is an invincible summer .',
'les de la vida pour la vie a es se non het la vida .',
'i say the last thing , in the end , is to die .',
'what does man understand is that the true man is a unjust child .',
'the soul is a dead cause .',
"it seems that there is nothing less than a child 's love .",
'that is why the world is governed by the laws of physics .',
'the king is a genius who goes to school to be a public relations yes .',
'the child is born of this solitude .',
'i am a tree among trees .',
'we have never got the least of ideas and ideas .',
'every age in the middle ages is born of the holy spirit of peace .',
'but no one is willing to strive for peace , justice or reason .',
"but n't the time is going to happen if what breathe is bad .",
'at the heart of all beauty lies something monstrous and full of things .',
'really , my heart is never .',
'yes , it is to say that there is a very rapid increase in human affairs .',
'everything in the world is like a dead world .',
'the good man is a man who wants to play my own .',
'there are no real truths , but it is a perpetual truth that is true .',
'you imagine that he can not live without knowing how to live .',
'the problems of life are not universal , but the future of mankind .',
'no one can build a bridge across the river of life .',
'passion is the desire to be strong . however it is necessary to love the weak .',
'in the end one must have experience to live with envy .',
'from the dark horizon of my future a future will never be seen .',
'he who does not know has the meaning , his reason to do .',
'no one has any idea how youth ... must have learned how to walk .',
'it is true that mersault was a very poor soil .',
'this is where we see and where there are , and what we see here .',
'a species of th',
'there are no boundaries between those limits of physical limits .',
'man is one who has the advantage of being overcome',
'woman is a woman . she is a tahu .',
'to live is to live alone',
'the fate of a man must be as great as the rank of man .',
'all artists of the twentieth century are quite free to live alone !',
'there is no justification of the state for the astonishment of the world .',
'there is evidence that in every human being , a human being must win the match .',
'the world is worth living .',
'the dumb man is not a bad bad conscience but a bad liar',
'because we have to choose between being understood we have a friend and love .',
'the mother of child dignity is a mother or mother .',
'it is the art of value that we do not understand .',
'a writer has been written with a definite idea of what is really in his universe',
'they believe that something is rare for the rare .',
'every step forward in the world preclude a future and there is a certain future .',
'and continuing that is the horror of original conservation .',
'solitude is often an activity .',
'one concerns me that things can never be forgotten .',
'i love people who have no happiness , but not lest they become strong .'
You must have seen equally impressive generated text in other articles. But I think what's impressive here is the quality of generated text given that my training set was extremely small (5k sentences). This is only possible using a pre-trained model.
t may not seem like much but I think an idea like "the limits of my language are always your limits" seems like something the language philosopher Ludwig Wittgenstein might have said. In fact, when you Google this phrase, you find no exact results but Google recommends checking the Wikipedia article on Wittgenstein.
In reality, Wittgenstein had said: "The limits of my language mean the limits of my world" and our model has smartly (and in a grammatically accurate fashion) changed it to something new.
Similarly, the generated quote "the present man has a reason to be able to give birth to a dancing star" is reminiscent of Nietzsche because he has mentioned "dancing star" in his books but he never said it in context of the present man. I may be reading too much into it, but to me, the generated quote represents the idea that we've become so technologically advanced that we can give rise to really complicated machinery (like a dancing star) and we've become so competitive that we have a reason to do that. (Is my neural network warning us of the potential dangers of AI and the inevitability of it?)
Let's give rise to the dancing star: generating new machine learning ideas
Remember that my corpus for philosophy quotes was ~5000 sentences. I wondered how this approach will perform if I were to give it an even smaller corpus.
I decided that generating machine learning ideas would be fun. To my knowledge, nobody else so far has tried doing that. So I collected titles of all the machine learning projects that students at Stanford's CS229 class had submitted
from the year 2004 to 2017. The dataset includes 2500 ideas
comprising of five to seven words each. The dataset and the corresponding notebook are available in my repository
. (Note: I don't own the copyright to ideas. It's collected merely for research and exploration)
The project seemed exciting but my main worry was that the domain of machine learning project ideas is very narrow and contained niche and technical terms. I thought the model will mostly spit out memorized ideas, the same as the ones in the dataset.
However, to my surprise, it generated some very novel ideas (in bold, followed by my commentary):
- "a different genre of a social video game via digital camera behavior". There's no "social video game" phrase in the dataset, so this must be new.
- "a machine learning approach to learning to recognize events from academic topics". There's no "academic topics" phrase in the dataset.
- "predicting what is the difference in frequency in data" <- there's no "Frequency in data" phrase in the dataset.
- "Using learning to predict features for identifying gene expression" <- actually a novel idea!
- "classifying human gene expression in optical images" <- there's no project idea on classifying human gene expression.
- "making an image of the world". I think this is an interesting project suggestion where you have to come up with one image/graphic that represents the entire world.
- "predicting the dimensions of human behavior". Possibly a suggestion for unsupervised classification on all the different ways that humans behave?
- "reinforcement learning to improve professional learning". Training dataset doesn't have the phrase "professional learning". How do you do improve learning ability in professional courses by using ideas from reinforcement learning? I was really impressed by this one as it seems both valuable and doable.
- "a single expression of what â??s on the corporate market ?". How do you combine all indicators available for stock markets to come up with one indicator that's most informative?
- "types of cardiac processes". Unsupervised learning to cluster similar patterns of cardiac processes to help in predicting and analyzing the ones that can lead to a cardiac arrest.
- "in the natural history of human interaction". Using human migration dataset, how do you classify historical human interactions. Can you generate new insights on human interactions that historians and anthropologists have missed?
- "classifying the convolutional characteristics of remote - sensing images". The dataset doesn't have the phrase "convolutional characteristics". This project sounds like a fun research project for anyone who's interested in the theory behind CNNs.
- "classifying and predicting the event reviews" <- wow, the dataset doesn't have the phrase "event reviews". Just like IMDB reviews, can we collect event reviews (of plays or rock concerts) and predict for future events which ones are going to be successful and which ones will be unsuccessful?
If you want unfiltered output from the model, here are 100 ideas that it generated. I've not modified anything (just bolding the ones I think are interesting and novel).
'the problem is right: grasping and extracting the face of pose',
'applying machine learning to text treatment',
'machine learning techniques for learning through machine learning',
'a machine learning approach to predicting career success from a single object',
'using machine learning to predict the outcome of a machine learning approach',
'based on stock prices',
'identifying stock price models for machine learning techniques',
'a study in time travel time series analysis of musical features',
'vectors in the amazon impact network',
'classification of web articles in facebook',
'dynamic signal processing in post - secondary order data',
'copy selection with machine learning techniques',
'interpretation of user classification',
'the application of deep learning to fairness in using a semantic framework',
"creating a different entity 's portfolio",
'using supervised learning of blind data',
'system classification for driving automatic vehicle design and classification with gene expression',
'based on public documents from text expression',
'semantic learning for music',
'machine learning for cancer prediction',
'learning static variations with deep learning for learning options',
'image classification for svm',
'satellite imagery classification',
'making decision selection from a single object',
'object detection using product preferences',
'speech detection with deep learning',
'genomic data based on stock trading',
'learning to predict approach to handwriting',
'classification of musical features from the composer data',
'semantic social network and smartphone features',
'machine learning techniques',
'using real - time information to predict the popularity of the market',
'video game classification',
'a learning task for time series players',
'using a single machine learning approach for a single learning approach to learning to identify other environments',
'multiple - genre classification of fraud \n prediction for a mass neural network',
'learning of human activity recognition from analysis of text',
"an nba player 's approach to learning and character forecasting through video game ecg",
'playing a vocal instrument in local mri learning',
'real - time music recordings',
'finding new artistic and artistic features in music videos',
'an analysis of musical genres',
'predicting a single image - specific musical style',
'a cost approach to crime prediction',
'automatic user prediction and automated review recognition',
'food processing via machine learning',
'human activity recognition using multi - label fantasy',
'predicting a match in the keystroke poker',
'estimation of game types',
'ai identification of deep learning in locomotion monitoring using neural networks',
'the value of collaborative attention projecting for real - time playing',
'the sea level and low speed : the two waves',
'learning to predict the price of beer and personal genomes',
'trading and removing a novel image from the text',
'real - time news user identification on google gestures',
'removing and re - learning to play game and lyrics',
'rapid - mass dynamics with acoustic images',
'real - time music direction',
"what 's your right ?",
'exploring event and music',
'human activity prediction using machine learning',
'model of architecture in california',
'vs light crime',
'adaptive learning for image recognition',
'predicting the approach of human activity using machine learning',
'the win given trajectories',
'a machine learning approach to online design',
'a massive based multi - layer feature unsupervised approach for multi - agent music',
'can you learn from a single hand',
'reaction with the media',
'measurement of time to order over time',
'how people can stop : learning the objects of blood and blood',
'machine learning for autonomous vehicles',
'vehicle types in neural networks',
'building a model for what does it store ?',
'for enhanced identification of machine learning techniques',
"exploring new york city 's public image through machine learning",
'a novel approach to career image recognition',
'in general game playing',
'structure classification for adaptation of text',
'a variance learning approach for speech recognition',
'the optimization of a non - peer temporal layer',
"a distinguishing feature of a song 's legal expression",
'learning to sound in english : learning to learn using word learning',
'information sharing with adaptive neural networks',
'playing the game with multi - touch neural networks',
'recursive estimation of dynamic and static images',
'predicting the quality of the net - style result in the media',
'the character of the sea snake robot',
'predicting the stock market price of machine learning',
'using inverted nucleotide data to predict the price of convolutional protein models',
'using twitter data to predict prices in high - cost trading',
'a machine learning approach',
'creating a new approach to building a deep learning approach',
'fingerprint learning component',
'machine learning techniques for functional change learning for the building of new york city college football networks',
'predicting cancer risk of breast cancer risk',
'cancer diagnosis and prediction',
'stock market classification',
'identifying the outcome of the news media'
I haven't checked thoroughly, but random checks tell me that most of the generated ideas are unique. I think the reason why the generated text isn't memorized from training corpus is because we're using a pre-trained model. The pre-trained language model was trained on Wikipedia and hence it has strong opinions on how concepts and words are related even before seeing training data.
For a model that's initialized randomly, the easiest way to reduce training data is to remember the training corpus. This results in over-fitting. However, for a pre-trained model, if the network tries to learn the training corpus, it can only do that if it first forgets previously learned weights. And since that leads to a higher error, the easier way is to accommodate training corpus within the context of earlier learned weights. Hence, the network is forced to generalize and generates grammatically correct sentences (thanks to pre-training on Wikipedia) but using domain-specific concepts and words (thanks to your dataset).
What would you train using this approach?
Before pre-trained models were available, you needed a huge corpus of text to do anything meaning. Now, even a small dataset is enough to do interesting things. Let me know in comments what project ideas come to your mind that could use a small text corpus along with a pre-trained model.
Some ideas to get your neurons firing:
- Using your tweets, train a model that tweets like you
- Using data dump from your WhatsApp, make a bot that chats like you
- For your company, classify support tickets into BUG or FEATURE REQUEST
- Make a bot that generates quotes similar to your favorite author
- Make your own customized AUTO-REPLY drafter for Gmail
- Provided a photo and an Instagram account, generate caption in the style of the account's previous captions
- Generate new blog post ideas for your blog (based on previous blog posts titles)
Also, it'll be super cool if you end up implementing a machine learning project idea generated by my model (or the one contained in this post). You'll be part of the world's first project that a machine has thought of which a human implements!