In a previous post, we demonstrated how to use the Python3 library Newspaper to painlessly extract data from news articles. Using Newspaper, I was able to extract text from over a 1000 articles about topics including, but limited to Data Science, Artificial Intelligence, and Big Data. In this follow up post, we’ll use unsupervised machine learning tools to investigate this corpus of articles, specifically, we’ll be using clustering and topic modeling to derive the various kinds of data science articles that are published.
First up, let’s show a wordcloud of their corpus to get a feel for its content.
We see there is a lot of familiar terms that we’d expect from this collection of documents. What makes this wordcloud is that it uses so-called “keywords” derived from each article using the newspaper library. This library has a special function that parses the main topics or keywords from each article, which is a great way to prevent stopwords and other meaningless terms from polluting the word cloud.
I decided to use Non-Negative Matrix Factorization to determine the topic modeling. In the following code, I used a TFIDF vectorizer to extract features from my corpus, which I then passed into an NMF object to derive the topics of my corpus.
These seven topics show the nature of the content in my corpus. Topics 1, 2, 3, and 6 are the ones most related to data science or data science topics. The other topics (4, 5, 7) are about articles related to business and/or non-data science subject such as blockchain.
After determining the topics I wanted to cluster the articles and visualize those clusters. The following image display five color-encoded clusters (AI, business, analytics, tutorials, machine learning) on a two-dimension graph.
I'm a journalist turned data scientist/journalist hybrid. Looking for opportunities in data science and/or journalism. Impossibly curious and passionate about learning new things. Before completing the Metis Data Science Bootcamp, I worked as a freelance journalist in San Francisco for Vice, Salon, SF Weekly, San Francisco Magazine, and more. I've referred to myself as a 'Swiss-Army knife' journalist and have written about a variety of topics ranging from tech to music to politics. Before getting into journalism, I graduated from Occidental College with a Bachelor of Arts in Economics. I chose to do the Metis Data Science Bootcamp to pursue my goal of using data science in journalism, which inspired me to focus my final project on being able to better understand the problem of police-related violence in America. Here is the repo with my code and presentation for my final project: https://github.com/GeorgeMcIntire/metis_final_project.