I decided to give myself the challenge of seeing if I could teach a machine learning model to detect the difference between Onion articles (and other satire) and real news articles in less than an hour. These were my results:
The first step to solving this problem was gathering the training data. This is by far the most difficult part. And there are a number of ways you might accomplish this. One way is to try scraping websites. If you're looking for the right tool, you could certainly try "https://scrapy.org/" or https://scrapinghub.com/open-source, or simply build your own (if you're so inclined).
I'm not really good enough at computers to figure out how to scrape websites, so I decided to gather the data manually (cringe). This process took up most of the hour. I visited www.theonion.com and started copying and pasting the content of articles into individual text files, which I simply named sequentially ("0.txt","1.txt",etc...). I saved these text files into a folder called "satire". (After the first 10, I started to get really fast with the Command-C, Open a new Textedit window, Command-V sequence.)
I certainly don't recommend doing this manually. Be a better programmer than me and do proper web scraping.
Once I had about 300 example Onion articles, I switched to Reuters and AP news as a source of non-satire news. I started to manually copy and paste these articles into text files and save them to a new folder I called "notsatire".
Again, it was slow going at first, but once I got the hang of it, things sped up. At this point, I was at it for about 45 minutes and was getting pretty tired. I decided that 300 of each class was enough to see if I might train a classifier to detect the difference between the two.
Naturally, my plan was to use Classificationbox as it makes it extremely easy to build and deploy classifiers. And, to make things EVEN easier, I quickly cloned Mat Ryer's handy textclass tool from GitHub so that I could run this puppy directly off of the files on my computer (instead of trying to build API queries with the text in the body).
First, I booted up Classificationbox.
Then, I fired off this command:
go run main.go -teachratio 0.8 -src ./testdata/fakenews/
and sat back to watch the results come in.
In about 1 second, Classificationbox was trained! Then it took another 2 or 3 seconds to settle and then validate.
The results show an accuracy of 83%. This tells me that it is very possible to build a classifier that can detect the difference between satire and real news, I probably just need to add more examples to increase the accuracy.
Just to be absolutely sure, I took 5 minutes and added 20 more articles of each and reran the test. This time I got a score of 86%.
What this demonstrates is that you can get started with building a machine-learning powered classifier without a whole lot of training data. Even something as complicated as detecting satire or real news can be accomplished in under an hour!
I strongly recommend you give it a try for yourself.
What is Machine Box?
Machine Box puts state of the art machine learning capabilities into Docker containers so developers like you can easily incorporate natural language processing, facial detection, object recognition, etc. into your own apps very quickly.
The boxes are built for scale, so when your app really takes off just add more boxes horizontally, to infinity and beyond. Oh, and it's way cheaper than any of the cloud services (and they might be better)... and your data doesn't leave your infrastructure.