What Donald Trump and Biased Polls Can Teach Us About Data What Donald Trump and Biased Polls Can Teach Us About Data
Pity the pollster.  As the election cycle mercifully nears its inevitable end, cries of bias from the trailing party will grow... What Donald Trump and Biased Polls Can Teach Us About Data

Pity the pollster.  As the election cycle mercifully nears its inevitable end, cries of bias from the trailing party will grow louder, and a sport played for well over a hundred years, calling statistics lies, reaches fever pitch.

Donald Trump is, of course, correct. Survey polls are biased.  

Bias is certainly nothing new to statisticians whom since stats 101 class have been trained to correct for it. In fact, pollsters go to great lengths to explain their polling methodology and the statistical bias in their results.

Despite that fact that a lot of effort and quite a bit of science goes into correcting for bias, charges of poll manipulation in favor of one party over another are frequent.  Reputable pollsters cognisant of this scrutiny are transparent about the methodology they employ to correct for bias.  Statistical bias in polling takes many forms: from response bias, to under sampling, and nonresponse bias etc. More recent issues have arisen around cell phone vs landline usage in the sample sizes.

Take a recent ABC/Post poll that gave Hillary Clinton a 12 point lead of 50% vs 38%. The fine print of that poll at the beginning of page 5 identified party divisions at 36% Democrats,  27% Republicans, and 31% Independents. Quite a few people took issue with this poll having too many Democratic respondents making it unrepresentative of register voters. Critics call for the pollsters to weight the poll based on party affiliation. However, these skeptics miss a key point. Reputable pollsters don’t weigh their survey data according to party identification for good reason. Pollsters adjust for demographic items to ensure a survey sample is not under or over-representing a demographic such as age, race, location etc.  Demographic items should be verifiable (census data etc.).  However party affiliation is not demographic, but something the poll seeks to measure and thus should not be weighed.   

Identifying who a likely voter is is another area of contention.  Pollsters rely on registered voter surveys early in the election cycle and typically switch to likely voters around September when respondents are more likely to know if they will actually vote and for whom. Pollsters contend that the pool of those who vote is not typically representative of the total eligible population, therefore there’s a need to determine likely voters. Identifying likely voters is a difficult undertaking for a number of reasons including respondents not actually voting, a priori judgments etc. And, so, likely voters is another area where critics say polls are skewed.

Response bias is also a well known phenomenon that reputable pollsters account for with practises like ensuring leading questions are excluded. A response bias getting extra attention this election season is social desirability response bias. We have a tendency to present ourselves in a favorable light even to perfect strangers cold calling us. Respondents may tend to give socially desirable responses. Many have argued that Donald Trump’s candidacy has increased response bias to this presidential election.

Consequently, despite reputable pollsters best efforts, accusations around party affiliation, likely voters, and response bias are just a few of the areas that are blamed for ‘rigged’ polls. So, what does all this teach you about your data? Well, there are quite a few lessons to be drawn from this.

Primarily, at some point in your career, expect your peers, your boss, or the public to question your data sources and any biases contained therein.

Reputable pollsters have long understood the importance of transparency regarding their data sources, data collection techniques, and what bias are inherent in the models and data they employ. The primary lesson for any data scientist is that reputable pollsters are transparent about their data and apply strict principles of disclosure. Data Scientists should consider adhering to standards of data quality to ensure data collected ‘correctly represents the real-world construct to which it refers’.

The era of big data compounds this problem rather than solves it. The era of more and faster data collection means that pitfalls can occur more frequently. Don’t fall for the trap of believing big data, or n=all will, reduce your bias problem. Pollsters also know that size isn’t everything. Polls typically rely on 1,000 or less respondents. The important rule in sampling is not how many poll respondents are polled but, instead, how pollsters select their respondents using techniques such as random sampling. Not all, but some questions are best answered with “small data” before scaling up to big data.  Outlier bias is particularly common in big data because the bigger the dataset, the harder it is to find outliers. FYI. correcting these anomalies may make sense in some cases, but not when looking for outliers like manufacturing defects.

Another lesson is that new sources of data will create new biases. A January 2016 the Federal Trace Commission report highlighted how the era of big data can lead to bias against certain demographics, such as low-income and underserved populations, due to their inclusion in or exclusion from large data sets. This purports that data scientists need to account for their model’s bias. The report notes, “If the process that generated the underlying data reflects biases in favor of or against certain types of individuals, then some statistical relationships revealed by that data could perpetuate those biases.”

Now, Social media may seem like a panacea to our unwillingness to partake in interview based polls, but new data sources will contain new forms of bias that need to be, at the very least explained, and somehow quantified and corrected for. Take for example unstructured data and, specifically, social media. Though using twitter for sentiment analysis continues to improve, social media also has inherent bias issues. Millennial Bias (skewed to a younger audience), influencer bias, and access bias are but a few.

Data scientists are often accused of being enamored with models and not heeding data quality and data bias issues.  Polls are a good example of the scrutiny and criticism data sets are subject to. As data science permeates more aspects of our lives, (as I said) expect the public to rightly question your data sources and quality.  When a Donald Trump questions your data and

your methods how well prepared are you to answer them ?

As you advance in your career you will put greater emphasis on connecting  with fellow data scientists who are in the trenches and can guide you on what coding languages, tools, and practices they find useful.  Applied data science conferences such as ODSC West are an excellent way to accomplish and accelerate this goal.  ODSC events give you the opportunity to connect with your peers, and learn the latest languages, tools, and topics associated with programming for data science. You also get to hear and learn from some of the top coders who brought you your favorite open source tools and libraries.

©ODSC 2016

Sheamus McGovern

Founder of ODSC and Software Architect specializing in, complex multi-platform systems across multiple industries including finance, healthcare, and education.