Much of our machine learning capabilities come from structured data, but the real payload lies in the messy, unstructured data underneath. If we want to gain practical insights, machines have to learn to parse things like social media posts filled with misspellings or sarcasm or handwritten doctor’s notes with illegible lettering—what happens when we need to understand unstructured data with language models. So how do machines do this? Alex Peattie, the co-founder of PEG, has thoughts on where we’ve been with language models in the past and how they may help machines decipher these difficulties.
Origins of Language Models
Roughly 80 years ago, Alan Turing and a group of brilliant minds gathered to create the newest iteration of the enigma machine, which helped win the war and began the journey to unlocking one of our most persistent problems, how to define a language with data.
The machines they built were essentially early computers, able to comb through data to find the right answer to unlock the secret key that would take the Allies from jibberish to real Axis powers transmissions. The concept was challenging, and Turing was doing this on the earliest computer.
Short story…it was incredibly difficult. Even if they’d guessed the secret key, they may not even know it’s right because of noise, typos, and transmission errors.
Turing produced a novel approach. With Markov (of Markov models), they realized that it’s all about probability. This changed the terms to either “high likelihood” or “low likelihood,” and their language model was able to capture fewer false negatives.
So how does this relate to language models now? Even now, we’re feeding the model a text and asking for a likelihood. We use multiple language models now and have access to the highest probability to decide what the language actually is.
What Is Unstructured Data?
Structured data is anything you can represent in a text form. Neat columns. Filled in rows. It’s fundamentally clean and can be quickly processed.
However, there’s tons of data that doesn’t fit neatly into this structured data. Things like social media posts, handwritten notes, partial notations. Around 90% of an organization’s data is going to be unstructured, so you’re missing a considerable quantity of insight if you’re focusing only on neatly filled data.
Businesses can use these models with unstructured data to reveal potentially revolutionary insights for companies. Data scientists can examine this data to gain a lot more interesting insight, and here’s the key, more actionable insight than classic structured data.
Types of Language Models
Turing’s insight led to a series of language models that we may be able to use, and we often use multiple models together. Turing was focused on classifying language in a way that could offer insights into meaning and context.
- Count Based: Sometimes called statistical, these models are fast to train and were popular during the 80s and 90s. For example, “bag of words” models used to classify spam text messages are great for simple tasks. N-gram models pick up where bag of words leaves off by better identifying context.
- Continuous space: Arriving in the early 2000s and teens, they’re slower to train, but they offer state of the art performance. This method involves word vectors to account for concepts that are different but of the same essence. Rather than working out relationships mathematically, neural networks offer an easier type of classification.
- Class of 2018: Transfer learning using a super powerful general-purpose language model. It’s appeared only this year and offers unprecedented insight into unstructured data, regularly matching or outperforming the previous state of the art models.
Let’s walk through a few case studies using these models with unstructured data.
Marvel needed to analyze the trailer for a new show “Inhumans.” You could look at simple structured data (number of views, like counts, etc.) but there’s nothing actionable. Instead, an examination of the comments left through sentiment analysis using positive models and negative models.
We can use our models to infer the strength of sentiments. Sentiment gradients offered insights which discovered that while plenty of people liked the trailer, they weren’t nearly as adamant as those who hated it. Peattie was also able to glean what it was that people disliked, giving the trailer creators actionable advice for a second trailer.
Key takeaway: richer insights
In this hypothetical scenario, a business that’s been around for a few years doesn’t know much about its customers. Analyzing active social media accounts could allow the company to infer characteristics based on language use.
Theoretically, the content of tweets could reveal gender up to 82% of the time and both gender and age bracket 52% of the time. You could apply this learning to build a profile of the existing customer base and allow a business to more closely identify who is interacting and who isn’t.
Key takeaway: Post-hoc analysis (basically, data catch-up)
Statin Decline Study
Coronary heart disease kills about 18 million people every year, and one of the most popular treatments is the prescription of Statins. Some people just choose not to take them despite their efficacy. A team compared statin decline rates at different hospitals to find out more insight into why people make this decision.
The problem? This data didn’t exist. Instead, they mined the data from doctor’s notes. Since every doctor has a different style, but the team was able to achieve 90% accuracy in their unstructured data processing.
Key takeaway: cheaper and easier than mining structured datasets.
The Future Of Unstructured Data
Unstructured data is powerful and plentiful, and as our language models get better, we should see improvements in how we work with this often tricky subfield. Working through the noise gives us rich insight previously unavailable through structured data pulls, and the improvements could eventually provide a cheaper alternative to structured datasets.
Be sure to click through to Alex Peattie’s insightful ODSC Europe 2018 presentation for step by step guide to building the two main kinds of language models and recommendations for software to work with unstructured data more effectively.