The meaning of Artificial Intelligence (A.I) changes depending on whom is speaking. Right now the most prominent instantiation of A.I is the chatbot. Technology’s...

The meaning of Artificial Intelligence (A.I) changes depending on whom is speaking. Right now the most prominent instantiation of A.I is the chatbot. Technology’s biggest companies and plucky startups allocate resources to make chatbots more impressive, but they really haven’t broken into the mainstream yet. From the outside the infrastructure behind such undertaking can look complex, but a simple instantiation is quite accessible. Let’s look at the process of building a Twitter bot that talks to itself.

As with all of Data Science, data stands tall among the relevant resources. For this bot I used a corpus built from five Philosophy books available on Project Gutenberg. They were:

  1. Beyond Good and Evil by Friedrich Nietzsche
  2. The Analysis of Mind by Bertrand Russell
  3. The Critique of Poor Reason by Immanuel Kant
  4. The Prince by Machiavelli
  5. The Republic by Plato

Usually preparing these documents would involve a decent amount of pre-processing, but the main component of my workflow, the markovify package, shrunk this step considerably. All that I needed to do was to remove the pre and post amble from the documents which were not a part of the text. Even this step was mostly automated as I took advantage of the Guten-gutter package to do the cleaning. From there it was time to create a Markov Chain of the corpus by using markovify, and store it for further usage. This setup reduced the process to the eighteen lines below.

Given the hidden complexity behind this code snippet, a brief digression into the theoretical underpinning would not be amiss. In a nutshell, the foundation of the markovify package is its transformation of the corpus into transitional probabilities between words. Let’s say the corpus only consisted of three sentences: 1) I am well today, 2) I am fine now, and 3) I will be well. The word ‘I’ has a roughly 66% chance to be followed by ‘am’, and a roughly 33% to be followed by ‘will’. For a state size of 1, these pairwise probabilities would be calculated for each unique word in the corpus. The resulting transition matrix is called a Markov Chain, and the sum of each row is necessarily 1. However, the size of each state need not be one. Markovify’s default state size – and is the one used for this bot – is two. In this case, the library calculates the pairwise probabilities between each pair of consecutive words. For example, the phrase ‘I am’ has a 50% chance of being followed by either  ‘well today’ or ‘fine now’. This is the essence of how the bot will create new sentences.

From there the final step was to transfer my Markov Chain into the base casing of my bot and deploy it. The section of this tutorial before the “Heroku” section provides an easy guide to setup a Twitter application and make a post. My code was slightly different to suit my use case:

settings_tweet.py contained the credentials for my Twitter application. The bot produces a 140 character sentence on line 15. Running the script above multiple times is enough to produce tweets, but automating it is much better. The Heroku tutorial linked to above showed one method. I transferred the files to an AWS EC2 instance and wrote a cron job to enable the bot to tweet three time a day.

You can find the bot here.


©ODSC 2016

Gordon Fleetwood

Gordon Fleetwood

Gordon studied Math before immersing himself in Data Science. Originally a die-hard Python user, R's tidyverse ecosystem gradually subsumed his workflow until only scikit-learn remained untouched. He is fascinated by the elegance of robust data-driven decision making in all areas of life, and is currently involved in applying these techniques to the EdTech space.