fbpx
Watch: Understanding Unstructured Data with Language Models Watch: Understanding Unstructured Data with Language Models
As data scientists, we’ve seen a rapid improvement in the last decades in the tools available for working with structured data (be it tabular... Watch: Understanding Unstructured Data with Language Models

As data scientists, we’ve seen a rapid improvement in the last decades in the tools available for working with structured data (be it tabular data, graph data, sensor data etc.). Yet, the vast majority of our data (Merrill Lynch puts the figure at roughly 90%) is *unstructured*, and lives in the form of documents, emails, reviews, reports, and chat logs etc. Many of us are far less familiar with how to analyze and understand this trove of unstructured data.

This talk by Alex Peattie focuses on language models, one of the most fundamental tools for working with unstructured data. Language models are all around us (although we’re probably unaware of them), underpinning everything from Word’s spellchecker to home assistants like Alexa. While plenty of “out of the box” language modeling libraries exists, the first part of the talk focuses on getting a thorough understanding of what a language model is, and how it works. We touch on key ideas from statistics and information theory, and see how Alan Turing, in developing techniques to break Nazi codes at Bletchley Park, created the smoothing techniques which remain widely used in language models today. We then proceed to the present day, looking at how techniques like word vectors and transfer learning have yielded an improved generation of tools. In the second half of the talk, we look at how we can practically use language models to understand unstructured data.

[Related Article: Why Use Continuous Intelligence in DevOps/DataOps]

Specifically, this video explores:

– Classification: the canonical application of language models, they can help us identify spam, analyze sentiment or perform unsupervised clustering. We look at a famous case where language models were able to successfully identify a Shakespeare forgery.

– Predictive modeling: if I were to look at your Tweets (and nothing else), could I guess your gender? It turns out state-of-the-art techniques can successfully predict it with an 80%+ success rate. We look at how language models can enrich your datasets with additional demographic or contextual data.

– Information retrieval: finally, we see how language models have been used extensively (for example in the legal sector), to extract targeted insights from enormous data sets.

ODSC Team

ODSC Team

ODSC connects you to the world’s largest community of practicing data scientists, AI experts, and industry executives. Our data science conferences in Boston, San Francisco, London, and other locations attract thousands of data science attendees. Our co-located CxO Summit and AI Expo is a magnet for top executives from some of the world’s largest companies seeking solutions.

1