It’s a truism bordering on a cliche to say that every day, there’s more and more data being collected. As a data scientist, this...

It’s a truism bordering on a cliche to say that every day, there’s more and more data being collected. As a data scientist, this is a great situation in which to find myself: so much data for training machine learning algorithms! But at a certain point, the amount of data out there starts to become overwhelming, and it’s not clear what data will be most useful for solving a given problem. Say I want to build a model of political behavior, like how liberal or conservative someone is–what’s the best dataset to go to when I am building this model? Demographic data? Data from previous elections? Social media? Recent campaign contributions? My goal as a data scientist is to live in a world where, when an interesting new problem walks in the door, I can quickly reach for the datasets that I think will be most useful for solving that problem.

I recently spoke at SciPy about one approach I find interesting for bringing order and scientific methodology to assessing which data is most relevant for a given problem. It’s called Item Response Theory (IRT), and I think it’s one of the best-kept secrets of quantitative social science.

Before we dig into IRT, some background would be helpful. Let me start by challenging the idea that predictive modeling is just about “building a model.” At Civis we have to start upstream of the modeling step, because we’re often solving problems that don’t have ready-made datasets available for training machine learning algorithms–we have to collect, build, aggregate, clean, and assess these datasets ourselves. Only once that hard work is done, and we have the data in hand, does it make sense to start building models. But there are trade-offs. Once those first-pass models are built, we often find that they don’t do as well as we would like, which usually means going back and trying different algorithms or tuning hyperparameters. However, our predictions aren’t just a function of the algorithm, but of the the underlying data that was used for training.

Years of experience with building models has taught me two related heuristics about the importance of algorithms vs. data when building a predictive model:
1. A complex, intricate, and carefully crafted algorithm trained with mediocre data will give unimpressive results
2. A simple algorithm trained with a dataset that really captures the relevant patterns can do quite well

With these simple rules in mind, a reasonable response to a poor model might not be to invest lots of time into the algorithm. Instead, we should revisit the training data! How do we really know that the data we’re using captures the trend that we want to extract from it?

Put another way, the question I’m interested in is how we can develop the same intuition for our data that we have for our models. Is there an analysis that I can devise that studies the data itself, and allows me to know the best dataset for training a predictive model?

This brings me to IRT. IRT is a framework that comes originally from education and psychometrics. One well-known example is the SAT exams, where many students take a multiple-choice test which assesses their scholastic ability; another interesting use case comes from political science, where researchers rank-order congressional representatives in terms of partisanship (liberal/conservative) based on how they’ve voted on legislation. These examples, and IRT in general, can be thought of as latent trait models, where there is a characteristic of a person (scholastic ability, partisanship) which might be difficult to measure directly but which we can indirectly access via many measurements (test questions, congressional votes) performed by many examinees (students, representatives).

Here’s the gist of IRT (for much more detail, Baker & Kim is an excellent reference). Say we have a student, with some unknown ability, who takes a test. We already have information on how easy or difficult each test question is, and we anticipate that the student will get easy questions correct and difficult questions wrong. Then we can imagine plotting each answer from the student, where the x coordinate is the question difficulty and the y coordinate is 1 (correct) or 0 (incorrect). If we fit a logistic curve to this data, we’ll find a turn-on where, as a function of question difficulty, the student goes from usually getting questions right to usually getting questions wrong. The location of this turn-on is our estimate of student ability–for smarter students, this turn-on will be far toward the “difficult” end of the spectrum, indicating that a high-ability student mostly gets questions right, even the hard ones. Likewise for low-ability students; they will miss a lot of questions, even easier ones. And of course, there’s a continuum in between those two extremes, where most students will end up.


Here’s the second part of IRT: take that scenario and flip it around, so that we imagine having many students of known ability and we have a new question being added to the test. We want to quantify how easy or difficult the new question is. Again, we can imagine a plot of the resulting data, where we plot correct/incorrect on the y axis as before but now the x axis is the student ability. Similarly, if we fit a logistic curve to this data, there will be a turn-on, and the location of that turn-on now quantifies how difficult the question is. Difficult questions are questions that many students get wrong, even the high-ability ones; medium questions will be answered correctly by high-ability students and incorrectly by low-ability students, and easy questions will be answered correctly by most students of all abilities.


Of course, we are usually not in the luxurious position of quantitatively knowing student ability or question difficulty (or whatever latent traits are of interest), but we usually do have many students taking a test and many questions on the test. That allows us to build a big matrix, where each student is a row and each question is a column, so each entry in the matrix records whether student X gets question Y correct or not. There are several methodologies within the IRT framework that allows us to solve for both student ability and question difficulty at the same time, generally by estimating one, using those preliminary results to estimate the other, then using those estimates to refine the first set of estimates we made, and so on iteratively until the results converge and we have our answers.

So what does all of this have to do with data quality?

Let’s take our student-test example, but now the student isn’t a person: it’s a dataset. Specifically, it’s the dataset used to train a predictive model. Then that model will be validated by using it to make predictions on our test set, which is a subset of the data which was NOT used in training but for which we do have the correct answers. If we keep track of which predictions the model gets wrong, and which ones it gets correct, we’ve basically made a vector of “correct” and “incorrect” predictions.

Now imagine that we can repeat that process many times, with a different type of data used each time. We’re keeping the algorithm the same, as well as the dependent variable; the only thing that changes from one model to another is the training data. Each model will get some predictions correct, and some of them wrong, and we can record this information for each model. With this process we can start to build a matrix of datasets and predictions, where would not be a huge stretch to compare each training dataset to a student, and each prediction to a test question. Then that matrix can be sent into the formalism of IRT, which will then produce best-fit estimates of prediction difficulty (based on how many and which datasets get a given prediction right/wrong) and dataset predictive ability (based on how many questions, and of what difficulty, are answered correctly by a given dataset).

These latent trait models are well known people-based data science but nothing in the math requires person-level analysis. So I’m suggesting to think outside of the box to use this type of workflow to understand the quality of the data you are working with.

You can find the slides here or watch my full talk below.

Originally posted at