Data Science on Twitter
By: Jason O’Rawe – ODSC data science team contributor
Twitter is an indispensable resource for data scientists as well as for the broader data science community. With the right connections, you can use twitter to learn data science, discover new technologies, computational tools and methodologies, and you can contribute to and build a community of data scientists working for the social good.
Indeed, we at the ODSC use twitter to spread the word about the great data science speakers and workshops happening conference events like ODSC East. With a good twitter list, however, you can bring much of value and content that comes with attending an influential meeting like ODSC East directly to your twitter feed!
Data science is a highly diverse and interdisciplinary field, but does data science twitter chatter reflect its interdisciplinary nature? Are there distinct communities of data scientists that interact with and cater to distinct sub-fields? To begin seeking an answer to this question, we will walk you through the simple analysis of a weeks worth data science related tweets.
A data science twitter network
Tweets were collected using a tweepy listener (see here1 for a tutorial on building a twitter listener), and stored in a text file named “data_science_twitter.txt”. Let’s first load the tweets and extract user mentions to take a quick look at the volume of data science tweets from this week.
Tweets and network edges (links between twitter users) were gathered based on user mentions. How many tweets and user mentions were there?
There are 159600 tweets about data science this week, and 162070 user mentions!
The data science twitter community is incredibly active; we saw almost 160,000 tweets within a single week! And, there seems to be just as much interaction within the community, as there is about the same number of user mentions, not including self-mentions.
But what does the network look actually like? To build a network and find the most influential data science twitter uses, we will use the NetworkX2 package to create a directed graph and to calculate eigenvector centrality (a measure of network influence) among the nodes (twitter users). The resulting network is plotted using Gephi3.
[(1, (u'GilPress', 0.38942565243403915)), (2, (u'KirkDBorne', 0.30906334335611996)), (3, (u'Forbes', 0.23035596746895132)), (4, (u'BernardMarr', 0.21142119479688257)), (5, (u'bobehayes', 0.2072355059058224)), (6, (u'kdnuggets', 0.15597621686762647)), (7, (u'Ronald_vanLoon', 0.15518713444196847)), (8, (u'LinkedIn', 0.12561861905035457)), (9, (u'DataScienceCtrl', 0.11756733241544594)), (10, (u'BoozAllen', 0.11138358070618962))]
Nodes represent twitter handles and the edges between the nodes represent user mentions. The size and color of the nodes correspond to eigenvector centrality values, which, again, is one measure of network influence. Let’s take a quick peek at the top 10 influencers (who are also plotted above):
The top 10 influencers include some of the most respected individuals and organizations in data science, and so their influence among data scientists on twitter is not at all surprising.
However, data science is a highly interdisciplinary field. Different communities may have different topic foci and different community influencers. For data scientists working in different sub-fields or in different spheres of data science, it is important to know who the most influential figures in the various sub-domains are, as these will be the people/handles to follow for the most up-to date news, analyses, methods and tools. To find distinct data science communities, we will use a community detection algorithm implemented to work on top of the NetworkX package, Community4. It implements the louvain method5 for community detection.
1234 distinct communities were detected
Here are the top 10 most populous communities:
Community 25 has 2883 members
Community 3 has 2841 members
Community 11 has 2564 members
Community 22 has 2027 members
Community 13 has 1629 members
Community 17 has 1619 members
Community 39 has 1442 members
Community 38 has 822 members
Community 19 has 785 members
Community 45 has 776 members
Chatter among data science communities
We see that there are a number of highly populous communities detected in the larger network, and many more communities that are smaller in size. Let’s take a quick look at a few of the most populous communities. We will look to see who the most influential users are among each of the interrogated communities, and try to find popular topics that the community focuses on using topic modeling. Our analyses will focus on communities 11, 13 and 38.
Let’s start by visually inspecting the sub-network associated with community 11:
We see a number of influential handles in this subnetwork, but the top 5 are:
But what is this data science community talking about? To take a quick look at the types of topics that this community is interested in, we will use the topic modeling package Topik6 from Continuum. Topik gives a high-level interface to wildly popular topic modeling libraries in Python.
First we want to set up a directory structure for Topik to read from. We make each twitter user in the community a document that Topik can read:
Let’s now build a topic model for community 11 and visualize the result. Topik enables us to do so in a very streamlined way. We will simply tokenize the data, input a list of stop words and the number of topics to search for, then build the model and visualize the results using Topik.
The termite plot7 is a nice way to visualize topic modeling results. The x axis lists the topic numbers and the y axis lists frequent topic words. The size of the circle corresponds to the frequency of that word with respect to a topic. The termite plot for community 11 seems to shows us that the twitter chatter for this community includes broad data science topics like machine intelligence, analytics, data mining, but also includes a substantial amount of chatter about data science related blogs, blog posts or stories, as well as data science conferences such as ODSC Boston, tutorials, online classes and careers.
This community seems to reflect the general data science community, but also twitter handles who are influential community builders that routinely tweet about data science blogging, reporting and other community-focused topics, such as training, conferences and careers. Of course, it is no surprise then that DataScienceCtrl and kdnuggets are among most influential handles in this network. Not only are kdnuggest and DataScienceCtrl regularly the most active and respected sources of data science news and blog postings, but BernardMarr, EvanSinar and Datafloq are all highly respected and influential in the broader data science community.
If we look at the next community, community 13 (above) we see that the most influential handles include
The topic modeling and termite plot for this network seems to show a focus on enterprise business analytics, industry applications, data science for social good and disruptive companies. This is consistent with the top influencers in this community. BoozAllen, LaurenNealPhD and petrguerra are all Booz Allen associated accounts, with LaurenNealPjD and petrguerra being a senior associate and VP at Booz Allen, respectively. Interestingly, Booz Allen and Kaggle co-sponsored the data science bowl this year, and we do indeed see a signature of the data science bowl with topics including terms related data science for social good in this community. kaggle and wendykan, a data scientist at kaggle, are also among the top influencers of this community.
Lastly, if we take a look at community 38 (above) we see that the influential handles include
CloudExpo and ThingsExpo are both twitter handles devoted to cloud and IoT meetings.
This last community is enriched for chatter related to IoT, cloud business analytics, online courses, networks and security. This very nicely reflects the fact that the most influential twitter handles are related to cloud conferences and open source analytics/software, and IoT conferences.
We found that the data science twitter community is incredibly active and interactive, and that several important distinct communities exist and that the individual influencers for each community are experts in the subdomain or are otherwise highly regarded in the data science community. With Topik, topic modeling as well as the visualization of the results was incredibly stream-lined, and we can really gain a great deal of insight from collecting and analyzing even a single week of data science twitter chatter.
Go here for the Jupyter Notebook.