This blogpost is about topic modeling using data from this blog, opendatascience.com. From this, combined with the most visited articles of the...

This blogpost is about topic modeling using data from this blog, opendatascience.com. From this, combined with the most visited articles of the year, we will generate the most popular topics of 2017. Last year, we did something similar with popular articles streamed through twitter using Non-Negative Matrix Factorization to determine topics, article here, example visual below. Feature Image snapped from our introductory page tree map.



Find the code and the data to start your own project. We love feedback, so don’t forget to send us your comments and results.

0. Goal

This project is about identifying industry interest vis-a-vis the open data science blog. To do so, we collect all the articles and, well, analyze them. It both sounds fun and was. Because we presume no one person has read them all, there is exists no anecdotal distillation, but thanks to topic modeling we aimed to generate unbiased industry insight for 2017 with our blog, 467 articles, as our sample.

1. Topic Modeling is NOT Text Classification

Let us explain through examples. If you want to be aware of the topics you study you use Topic Modeling, it’s an unsupervised technique that allows you to extract topics from a corpus of documents.

On the other hand, if you have predefined tags and you want to classify new documents, you can train a model to learn about the tags and then apply it to the new documents. That is Text Classification, a supervised technique. The preprocessing of the documents is similar for both techniques.

2. Python Libraries

There are many tools and libraries choose from. We tried three Python libraries.

gensim for topic modeling.
nltk for Natural Language Toolkit with multiple functions and applications for NLP.
sklearn we can work with documents too.

3. Data Collection with Selenium

Collecting the data is always an important part of the process. There are a number of different methods and tools to collect documents on the web. Sometimes the website has some javascript that won’t show you all the content at once, for those instances we can count with Selenium library.

…or… you can try using the Python 3 library Newspaper as we did in Using The Newspaper Library to Scrape News Articles.

selenium automates web browser interaction from Python. For this, download this driver for the browser you want to automate. For this case, I’m using the Google Chrome driver on a MacBook Pro computer. The driver for this is on the driver folder, and called chromedriver.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

def saveLinksToFile(links):
  # Saves the url of the articles to a file to process them later
  with open('01_data/odsc_links.csv', 'w') as f:
  for link in links:

# Make sure you download the driver you need and it is accessible to selenium
path = 'driver/chromedriver'
driver = webdriver.Chrome(executable_path = path)

# This for loop will hit the end key 150 times to go all down the blog
# and will wait 2 seconds to allow the content loading and will hit it again.
# it is printing the times to have 
for i in range(150):
 print i,

links = driver.find_elements_by_class_name('sing_up')

If you run the code you should see an instance of the browser performing programmed tasks.

Next step. Download the content from those links we just collected.

file_path = '01_data/odsc_links.csv'
blogposts = pd.read_csv(file_path)
# We read the links and we create columns to store the title, the text on the article
# the tags, and date of publication
blogposts['title'] = ''
blogposts['text'] = ''
blogposts['tags'] = ''
blogposts['date'] = ''

# MAX_WAIT set how many times we will ask for a link before to jump to the following one.

def get_text(soup):
 Returns the text part from å beautifulsoup object
 for d in soup.find_all("div", class_='article content single-article lang-en'):
  return d.get_text()
def get_posting_date(soup):
 Returns the date of publication of the article.
 for d in soup.find_all("div", class_='entry-meta'):
  d1 = str(d).split('|')
  d2 = d1[1].split('/')
 return date(int(d2[2].split(' ')[0]), int(d2[0].split('>')[1]), int(d2[1]))

def get_content(link, delay = 3):
 Returns the title, the tags, the link, and the posting date of the article.
 tags = []
 title = ''
 text = "
 posting_date = ''
 if "?p=" not in link and delay <= MAX_WAIT:
   r = urllib2.urlopen(link).read()
   soup = BeautifulSoup(r)
   text = get_text(soup)
   title = soup.title.text.split('|')[0].strip()
   tags = []

   for tag in soup.find_all("p", class_='tags_st'):
    for a in tag.find_all("a"):
   posting_date = get_posting_date(soup)
   print "({})".format(delay + INCREMENT),
   get_content(link, (delay + INCREMENT))
  return tags, title, text, posting_date
  return None, None, None, None

# This for loop will get the data from the blogposts calling the function get_content() for each url
for i in range(len(blogposts)):
 print i,
 tags, title, text, posting_date = get_content(blogposts.iloc[i]['link'], 0)
 blogposts['tags'].iloc[i], blogposts['title'].iloc[i], blogposts['text'].iloc[i], blogposts['date'].iloc[i] = tags, title, text, posting_date

# We can save the the content on a json file for future use.
blogposts.to_json(path_or_buf = '01_data/data_posts_v2.json', orient='records', lines = True)

def readBlogposts(file_path):
 return pd.read_json(file_path, orient='records', lines = True)

file_path = '01_data/data_posts_v2.json'
blogposts = readBlogposts(file_path)

# we filter the articles published in 2017 only.
blogposts_2017 = blogposts[blogposts.date.dt.year == 2017]

4. Topic Modeling

Using scikit-learn and some code from here on topic modeling we can get the topics on the documents with LDA.

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

def print_top_words(model, feature_names, no_top_words):
 for topic_idx, topic in enumerate(model.components_):
  print "Topic %d:" % (topic_idx)
  print " ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]])

documents = [doc for doc in blogposts_2017['text'] if doc is not None]

no_features = 2000

tf_vectorizer = CountVectorizer(max_df = 0.95, min_df = 2, max_features = no_features, stop_words = 'english')
tf = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()

no_topics = 20

lda = LatentDirichletAllocation(n_topics = no_topics, max_iter = 5, learning_method = 'online', learning_offset = 50.,random_state = 0).fit(tf)

no_top_words = 10
print_top_words(lda, tf_feature_names, no_top_words)

Using Latent Dirichlet Allocation we will get 20 topics using 1000 features. I determined the number of topics in this case, but here is a way to determine the right number of topics for yourself.

These are the 20 topics.

  • Topic 0: data learning model using time different way values machine deep
  • Topic 1: data learning machine model use like time new ai science
  • Topic 2: word learning et al arxiv words language 2016 model neural
  • Topic 3: learning model training network image function data deep images used
  • Topic 4: jobs job data figure science software trends counts 2017 learning Topic 5: data use function using model dataset 10 import values let
  • Topic 6: amp quot pca data principal components com wp component eigenvectors
  • Topic 7: tree decision forest random loan data julia water charts algorithm
  • Topic 8: conversion neural travel rate art ai networks learning acquisition rates
  • Topic 9: tag countries world scraping text requests html data page self
  • Topic 10: data learning use models using new like used example different
  • Topic 11: women ai data saw work services used like numbers government
  • Topic 12: model data different learning new neural word example just image
  • Topic 13: effects learning entities nlp dask data network deep entity use
  • Topic 14: learning network data neural words rate model use example vectors
  • Topic 15: random reduce bag edge tree car collect task problem nn
  • Topic 16: date student year percent data title loan people countries jobs
  • Topic 17: docker container run image kafka root s3 command jupyter directory
  • Topic 18: model data set models machine test learning user engine values
  • Topic 19: tidy dbl data country year amp squared frame function statistic

5. Most read articles of 2017

This are the Top 10 most read articles in 2017. If we extract the topics using the code above. We get the following topics on the most read articles.

  1. /blog/jupyter-zeppelin-beaker-the-rise-of-the-notebooks/
  2. /blog/riding-on-large-data-with-scikit-learn/
  3. /blog/implementing-a-neural-network-from-scratch-in-python-an-introduction/
  4. /blog/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai
  5. /blog/how-to-build-a-fake-news-classification-model/
  6. /blog/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/
  7. /blog/what-is-the-blockchain-and-why-is-it-so-important/
  8. /blog/an-introduction-to-object-oriented-data-science-in-python/
  9. /blog/r-or-python-for-data-science/
  10. /blog/implementing-a-cnn-for-text-classification-in-tensorflow/

If we use the top 30 articles of 2017, and set the number of topics to 10, we get:

  • Topic 0: python used learning set community classification machine trained use make
  • Topic 1: network neural representation learning layer model size training input use
  • Topic 2: feature time dataset pooling network learning convolutional missing level performance
  • Topic 3: learning deep neural model network ai layer machine classification feature
  • Topic 4: learning models network systems labeled representation like generative based responses
  • Topic 5: learning time python like ai model machine real work used
  • Topic 6: probability rain ai bayesian answer interpretation algorithms problem bayes inference
  • Topic 7: network neural model layer input language use output training convolutional
  • Topic 8: learning feature network deep image layer figure neural pooling ai
  • Topic 9: python notebook class self code science scala object language notebooks


The next blogpost we will explore dynamic topic modeling, exciting, eh? Stay tuned.


Diego Arenas

Diego Arenas, ODSC

I've worked in BI, DWH, and Data Mining. MSc in Data Science. Experience in multiple BI and Data Science tools always thinking how to solve information needs and add value to organisations from the data available. Experience with Business Objects, Pentaho, Informatica Power Center, SSAS, SSIS, SSRS, MS SQL Server from 2000 to 2017, and other DBMS, Tableau, Hadoop, Python, R, SQL. Predicting modelling. My interest are in Information Systems, Data Modeling, Predictive and Descriptive Analysis, Machine Learning, Data Visualization, Open Data. Specialties: Data modeling, data warehousing, data mining, performance management, business intelligence.