Confidence interval is a basic statistical concept commonly employed by data scientists. Without a formal background in statistics, however, some data scientists tend to scratch their heads with respect to their understanding of what’s really going on with this notion. In this article, we’ll review the basics of confidence intervals for data scientists (without the mathematics) and show a simple example with linear regression.
What if you’d like to know what percentage of people in the U.S. are night owls (people who stay up late at night). In order to obtain a completely right answer, you’d have to ask each person in the country this question, but polling over 300 million people isn’t very practical.
One alternative is to get a much smaller random sample of people and then find the percentage of night owls in that sample. The problem is we won’t be totally confident that this percentage is correct or how far off this number is from the right answer for the entire population. So we’ll try to find an “interval” that provides the assertion “I am 95% confident that the percentage of people in the U.S. are night owls is between 12% and %16.” This declaration is based on what’s called a “confidence interval,” in this case 14 +/- 2 and the confidence is 95%.
When a pollster reports an estimate and a margin of error, in a way they’re reporting a 95% confidence interval. This means confidence intervals are a way of quantifying the uncertainty of an estimate. Further, if we take many different random samples, compute confidence intervals for each of those samples, 95% of those confidence intervals will be such that the population average would lie between those limits.
To demonstrate confidence intervals, we’ll use the well-known student survey R data set called survey from the MASS package. The data set has 237 observations and 12 variables. Our example will use two variables: Height – the height of the student in centimeters, and Wr.Hnd – the span of the writing hand in centimeters. We’ll use R’s lm() function to fit a simple linear regression model as shown below:
# Fit linear model: response variable Height,
# predictor Wr.Hnd
lm1 <- lm(Height~Wr.Hnd,data=survey)
# Show computed linear model components
# Slope: 3.117, intercept: 113.954
# lm(formula = Height ~ Wr.Hnd, data = survey)
# (Intercept) Wr.Hnd
# 113.954 3.117
You can easily derive the confidence interval for this model using the confint() function in R. You pass this function the linear model object as the first argument, along with the desired level of confidence as the second argument. The results of the function in the sample code indicate that you should be 95% confident that the value of the slope parameter, or in our case the Wr.Hnd predictor, is between 2.55 and 3.69. The common level of confidence values used are 90%, 95%, or 99%.
# Use confint() with current model and desired level of # confidence.
# 2.5 % 97.5 %
# (Intercept) 103.225178 124.682069
# Wr.Hnd 2.547273 3.685961
# 5 % 95 %
# (Intercept) 104.962490 122.944757
# Wr.Hnd 2.639469 3.593764
# 0.5 % 99.5 %
# (Intercept) 99.805876 128.101371
# Wr.Hnd 2.365815 3.867418
This example is very simple, but it’s important to remember that a good portion of the data scientist’s work is just simple math. Understanding the fundamentals is essential if you want to interpret data effectively. It will also help you do a better job at data storytelling. As a data scientist, a major part of your job is to communicate clearly statistical concepts to people with various levels of statistical knowledge.