This post is the first of a two-part series in which we apply NLP techniques to analyze articles about big data, data science, and AI.
If you are tired of the hassles of web scraping, then this post might be just for you. I occasionally web scrape news articles from the web for NLP/data science projects, such as my fake news classifier article. Even though analyzing trends in the news is one of my favorite applications of NLP, it irks me that I have to spend a considerable amount of time and effort crafting scripts to sift through piles of HTML code. So when I came across the Python 3 library Newspaper, I was overcome with joy.
In this post, I’ll demonstrate how to use Newspaper to download valuable information from multiple articles and how to put that data into a data frame.
First up, installing the library is simple. Here’s the pip command to do that.
pip3 install newspaper3k
The 3k is included so you install the Python 3 version instead of 2.
The following code demonstrates how to use the library to download the information of a single article.
As you can see from the code, this process was incredibly simple. The best part is that the information we want from the article is quite clean, we didn’t have to include any regex to extract the article title or text.
Now let’s demonstrate how to use this library on 50 links and put that the downloaded information into a pandas data frame. And this time, we’re going to profile (developer jargon measuring the time a script takes to complete) our code.
I downloaded and parsed 50 links to New York Times articles, and turned that resulting information into a pandas data frame. This whole process took me about 39 seconds, which is less than one second per link. This means that if I were to build a corpus of 1000 documents, I should expect a script to take almost 13 minutes to finish this task.
In the next article in this series, I’ll be analyzing those 50 links from above along with hundreds of other articles on data science, AI, big data, and more.
George McIntire, ODSC
I'm a journalist turned data scientist/journalist hybrid. Looking for opportunities in data science and/or journalism. Impossibly curious and passionate about learning new things. Before completing the Metis Data Science Bootcamp, I worked as a freelance journalist in San Francisco for Vice, Salon, SF Weekly, San Francisco Magazine, and more. I've referred to myself as a 'Swiss-Army knife' journalist and have written about a variety of topics ranging from tech to music to politics. Before getting into journalism, I graduated from Occidental College with a Bachelor of Arts in Economics. I chose to do the Metis Data Science Bootcamp to pursue my goal of using data science in journalism, which inspired me to focus my final project on being able to better understand the problem of police-related violence in America. Here is the repo with my code and presentation for my final project: https://github.com/GeorgeMcIntire/metis_final_project.
- Most Influential Data Science Research Papers for 2018 100 views | by Daniel Gutierrez, ODSC | under Featured Post, Modeling, Research
- Understanding the 3 Primary Types of Gradient Descent 55 views | by Daniel Gutierrez, ODSC | under Modeling
- The Data Scientist’s Holy Grail – Labeled Data Sets 55 views | by Daniel Gutierrez, ODSC | under Modeling, Tools & Languages, Workflow