I recently had the chance to use machine learning to address an issue that is at the forefront of the American media, the difficulty of recognizing fake news. Specifically, my classmate David Masse and I applied two ML approaches to identify deliberately misleading news articles: logistic regression and a naÃ¯ve Bayes classifier. Using a kaggle dataset of 20,000 labeled articles, we achieved an accuracy of 93% when predicting labels for a test set. It was a great opportunity to practice natural language processing, as well as some of the effective techniques for building a powerful classification model.

Natural Language Processing is the field of computer science devoted to processing and analyzing any form of natural human language (written, spoken or otherwise). Put simply, computers understand zeros and ones, while humans use a wide range of language to communicate. NLP aims to bridge the gap between these two worlds so that data scientists and machine learning engineers can analyze large quantities of human communication data.

In the context of the fake news problem, NLP allows us to break down articles into their components, and choose important features. We then construct and train models to identify unreliable documents.

Cleanup Time!

The first order of business for many data-driven projects after exploration is to clean up your data. We were working with thousands of articles from a wide range of sources, some much cleaner than others. Regular expressions provided a way of limiting the types of strings that we allowed to be included in our analysis. For example, this line of code using the re python module:

replaces all characters that are not alpla-numeric with empty spaces. The ^ denotes the complement of the specified set is what we are replacing. Once we have removed undesirable characters, we are ready to tokenize and vectorize!

Scikit-learn is an incredible machine learning package for python, which does a lot of the heavy lifting. In particular, the Count Vectorizer creates a full vocabulary list of all the texts being analyzed, and turns each individual document into a vector representing the total count for each of these words. This returns the vectors in the form of a sparse matrix, as most articles do not contain most words. The vectorizer allows integration of preprocessing functions, as well as your preferred tokenizer.

In the case of news analysis, it is overly simplistic to consider each word individually, and so our vectorization allowed for bi-grams, or two word phrases. We limited the number of features to 1000, so that we only consider the most important features for classifying a document as real or fake.

Feature Engineering

Feature engineering is less of simple skill and more of a complex art form. It encompasses the process of considering your dataset and domain, deciding the features which will be most powerful for your model, and finally testing your features in order to optimize your choice. Scikit-learn's vectorizer extracted our 1000 base n-gram features, and we set out to add meta-features to refine our classification. We decided to calculate the average word length and number of numerical values in each document, which improved the accuracy of our model.

Sentiment analysis represents yet another analytical tool of NLP, assigning numerical values to the general sentiments expressed in a body of text. We made use of two packages, TextBlob and Natural Language Toolkit's Vader. These are great tools for performing out of the box sentiment analysis on text documents. Vader produces various scores measuring polarity and neutrality, while TextBlob offers overall subjectivity as well as its own measure of polarity. We had originally hoped that these sentiments would be distributed very differently when calculated for misleading and true articles, however we did not find that to be the case.

As can be seen above, there is very little insight in the sentiment scores alone. However, we decided to keep them in our models as they increased accuracy when coupled with our classifiers.

Logistic Regression

The system we are considering here is binary, where the only classes are real and fake. We need to model the probability that an article is unreliable, given its associated features. This is the perfect candidate for a multinomial logistic regression, whereby our model relies on a logit transformation and maximum likelihood estimation to model the probability of unreliability in relation to our predictor variables.

In other words, LR calculates the posterior p(x|y) directly, learning which labels to assign to which feature inputs. This is an example of a discriminative approach and although this is technically not a statistical classification, it gives the conditional probability of class membership which we use to assign a value. SciKit-Learn's logistic regression model offers a straightforward way to perform this.

Naive Bayes Classifier

Our second approach was to use a NaÃ¯ve Bayes (NB) algorithm, which despite its simplicity worked well for this application. With the assumption of independence between features, the NB learns a model of joint probability p(x,y), from each labeled article. Predictions are then made using Bayes' rule to compute the conditional probability, assigning labels based on the max probability. This, by contrast, is an example of a generative classifier.

In this case, we consider a multinomial event model, which most accurately represents the distribution of our features. The results closely match those achieved by the logistic fit, as expected.

In Conclusion

The presence of misinformation in the form of fake news can be effectively identified using machine learning. Even without contextual information such as headline or source, the body text provides enough to do the trick. As a result, these strategies can be easily applied to other documents where additional descriptors are not available. While sentiment features were not sufficient for fake news classification on their own, they did improve our classifier performance when coupled with other features. Future work could be done to compare to other popular models, such as support vector machines.