Author: Faruqui Ismail
By Faruqui Ismail and NookaRaju Garimella
Reporters with various forms of “fake news” from an 1894 illustration by Frederick Burr Opper
We’ve always pictured the rise of artificial intelligence as being the end of civilization, at least from watching movies like ‘The Terminator – Judgement Day’. We could not have imagined that something as insignificant as misinformation, would lead to the collapse of organisations; beginning wars and even mass suicides.
The definition of what we regard as “Fake” news has a broad spectrum. Consider an article published in 2001, which was true at the time. That same article being published now, excluding the date… giving it an appearance of recently occurring events. Would be regarded as “misinformation” or “Fake”.
In summary, we identified a need to identify the truth from misinformation and created a product that would help us do that. We began by creating 2 robots using BeautifulSoup (bs4) and Selenium, these robots extracted data from various fake news sites according to Wikipedia. We then supplemented this data with GitHub data (refer to acknowledgements).
Post cleaning and reworking the data using some Natural Language Processing(NLP) techniques, we proceeded to create features. By asking the question, what makes a fake news article different from a non-fake news article? We agreed on the following:
- The % of punctuation’s in an article (by ‘over-dramatizing’ events people will use more punctuation’s than usual)
- The % of capital letters in an article (once again, this takes care of e.g. “DID YOU KNOW”)
- If the article came from a website known for publicizing sensational/fake stories as tracked by Wikipedia
- Finally, we looked at poor sentence construction. Sentences constructed too long are usually indicative of someone who is not a journalist writing the article
To increase the overall accuracy of the final prediction. These features were then checked to see if they were not too correlated, and that the sub contents of some of these features did not overlap e.g.:
To avoid over-fitting of the model, feature transformation was done. This helped normalize the feature which helped prevent over-fitting. This visual is an example of the transformation done of % of upper case letters to the new article:
These small changes increased the final prediction precision by 9.63%.
Once these features were created, we dove into NLP. We removed all stop words; tokenized and stemmed the data; excluded all punctuation’s from the text etc.
Considering prediction times, preference was given to Porter stemming over Lemmatizing, NLP generally creates a massive quantity of features.
Again, balancing precision with the time it takes to run the program was a key consideration on which vectorizer to use. GridSearchCV to the rescue. We ran TFIDF Vectorizer as well as a Count vectorizer on certain parameters and recorded their fit times and prediction scores:
RandomForest was a strong candidate for our prediction, hence we used it. To identify the best possible parameters in the machine learning algorithm. A grid was constructed which provided the optimal n_est and depth which would yield the highest precision, accuracy and recall.
Est: 50 |
Depth: 10 |
Precision: 0.6921 |
Recall: 0.4833 |
Accuracy: 0.4769 |
Est: 50 |
Depth: 30 |
Precision: 0.8405 |
Recall: 0.8166 |
Accuracy: 0.7923 |
Est: 50 |
Depth: 90 |
Precision: 0.8479 |
Recall: 0.8416 |
Accuracy: 0.8461 |
Est: 50 |
Depth: None |
Precision: 0.8143 |
Recall: 0.7916 |
Accuracy: 0.8076 |
Est: 100 |
Depth: 10 |
Precision: 0.7159 |
Recall: 0.6416 |
Accuracy: 0.6153 |
Est: 100 |
Depth: 30 |
Precision: 0.8352 |
Recall: 0.8 |
Accuracy: 0.7923 |
Est: 100 |
Depth: 90 |
Precision: 0.8685 |
Recall: 0.8583 |
Accuracy: 0.8615 |
Est: 100 |
Depth: None |
Precision: 0.8936 |
Recall: 0.9166 |
Accuracy: 0.9076 |
Est: 150 |
Depth: 10 |
Precision: 0.7066 |
Recall: 0.6 |
Accuracy: 0.5615 |
Est: 150 |
Depth: 30 |
Precision: 0.8398 |
Recall: 0.8333 |
Accuracy: 0.8230 |
Est: 150 |
Depth: 90 |
Precision: 0.8613 |
Recall: 0.8583 |
Accuracy: 0.8461 |
Est: 150 |
Depth: None |
Precision: 0.8786 |
Recall: 0.8833 |
Accuracy: 0.8769 |
This entire project was then packaged into a web framework using Django. The view showed whether the data was e.g. unreliable, junk science, fake, true etc. It is in the process of being published on the web.
Authors and Creators:
Acknowledgements:
GitHub data: https://github.com/several27/FakeNewsCorpus
Wikipedia fake news list: https://en.wikipedia.org/wiki/List_of_fake_news_websites