Social media has fundamentally changed the way in which we interact with each other, and with the World Wide Web. Our web activities are now inherently social. We can keep in touch with close friends on facebook without ever needing to pick up a phone or get on a train. We can use twitter to share our thoughts with the world, and we can explore someone else’s by scrolling through pinterest boards, but at the same time the social web is diffuse, sprawling and unorganized. Hashtags are one way to manage the chaos.
In general, hashtags are simple phrases or words that are preceded by a pound sign (#). They serve the purpose of tagging a picture, post or message as being related to a specific or popular topic. These tags make information more accessible to a wider audience, and, if used correctly, they can help grow our digital social circles. But how should I pick the best hashtag? The best hashtags are the ones most relevant to the post, and the ones that will reach the largest number of viewers. In this post, we explore one way to pick the ‘best’ hashtag(s) for your tweets, using python and a little network analysis.
To start, let’s think about sending out a tweet about a really cool new data science article. I could just use the #datascience hashtag, but let’s take a closer look at what the available options are. To do that, we will use a weeks worth of data science related tweets that were collected for a previous post, and visualize hashtag frequency using wordcloud, a very nice python library for plotting word clouds.
First, let’s load in our tweets and extract hashtags from all of the postings.
With tweets loaded and hashtags extracted we can go ahead and plot them using wordcloud. The wordcloud library gives the user the option to define custom coloring functions, and it allows the use of custom images for image masking and plotting. For this example, I will make a function to randomly select one of three red colors and use a web-image of a cartoon brain as an image mask.
The size of the text is a proxy for its frequency. There are a large number of hashtags in data science related tweets, but “BIGDATA”, “ANALYTIC” and “DATASCIENCE” are the most popular. Let’s take a look at the top 10 based on number of occurrence.
The MACHINELEARNING hashtag was seen about 15211 times
The DATASECURITY hashtag was seen about 2673 times
The AI hashtag was seen about 3461 times
The IOT hashtag was seen about 13503 times
The BIGDATA hashtag was seen about 115575 times
The ANALYTICS hashtag was seen about 16333 times
The SECURITY hashtag was seen about 3518 times
The DATA hashtag was seen about 5681 times
The DATASCIENCE hashtag was seen about 24589 times
The CLOUD hashtag was seen about 3645 times
The “BIGDATA” hashtag was seen over 100,000 times. Impressive. Should I stop here and pick a random sampling from the top 10 hashtags and be on my way? It might not be a terrible strategy, but there is a better one. Twitter is a highly dynamic platform where the top hashtag(s) can change within the hour and even from minute to minute. As a simple exploration of this notion, let’s break our week’s work of tweets into hour increments and plot the rank of the top 10 hashtags relative to each other.
The result, of course, will be much simpler than what actually happens on twitter, as we will only see what happens to the weekly top 10 with respects to each other. We will not, for example, be able to see the hashtags that rise into the top 10 during each individual time period.
To start, we will iterate through each hour and see the relative ranks among the top 10 hashtags.
After gathering the relative ranks for each hour, we will plot them using plotly.
The “BIGDATA” hashtag is reliably the top tag among the 10, even in hourly increments, but the relative rank of the other tags varies from hour to hour. This suggests that in order to maximize the utility of our hashtag we’ll need a real-time indicator, and a way to narrow down our hashtag selection. Twitter only allows the use of 140 characters.
To satisfy both of these goals, we can use tweepy to collect tweets in real time and use NetworkX to build a hashtag network in order to find the most influential hashtags within a specified time window. Once hashtag networks are built, eigenvector centrality will be used as a way to rank and plot hashtags according to their influence in real-time. Using a centrality measure to pick tags achieves the same the viewership with one or two tags that would be obtained by overloading a tweet with several tags. Choosing a hashtag based on a simple frequency value does not always offer this same value.
The result is a constantly updating wordcloud plot indicating the best hashtags.Picking the best hashtag will depend on when you want to send out your tweet! The above plot is comprised of snapshots of data science tweets within 2 minute time windows taken over the course of the day. The top weekly hashtags are still important, but many other tags make timely appearances.
Hashtags are a powerful way to reach a broad audience and contribute to larger discussions on various social media platforms. A strong social media presence can really make or break a digital campaign for many consumer facing companies. Digital and social media marketing arms of most companies thrive on consumer interactions on twitter and other sites. Efficient hashtag choice can really improve visibility in the short and long term for digital brands. Using python, tweepy, NetworkX and wordcloud, we were able to develop a simple method for picking the best hashtag using network analysis and wordcloud visualizations.
See the Jupyter notebook here.
#ODSC ©2016 Please (Backlink) Share!
- 25 Excellent Machine Learning Open Datasets 38 views | by Elizabeth Wallace, ODSC | under Featured Post, Machine Learning, Modeling
- Building Your First Bayesian Model in R 35 views | by Nathaniel Jermain | under Machine Learning, Modeling
- The Programmer Myth of Data Science 28 views | by Elizabeth Wallace, ODSC | under Accelerate AI, Featured Post