fbpx
Confidence Intervals for Data Scientists Confidence Intervals for Data Scientists
Confidence interval is a basic statistical concept commonly employed by data scientists. Without a formal background in statistics, however, some data scientists tend to... Confidence Intervals for Data Scientists

Confidence interval is a basic statistical concept commonly employed by data scientists. Without a formal background in statistics, however, some data scientists tend to scratch their heads with respect to their understanding of what’s really going on with this notion. In this article, we’ll review the basics of confidence intervals for data scientists (without the mathematics) and show a simple example with linear regression.

What if you’d like to know what percentage of people in the U.S. are night owls (people who stay up late at night). In order to obtain a completely right answer, you’d have to ask each person in the country this question, but polling over 300 million people isn’t very practical.

[Related Article: Watch: Introduction to Quant Finance with Quantiacs Toolbox]

One alternative is to get a much smaller random sample of people and then find the percentage of night owls in that sample. The problem is we won’t be totally confident that this percentage is correct or how far off this number is from the right answer for the entire population. So we’ll try to find an “interval” that provides the assertion “I am 95% confident that the percentage of people in the U.S. are night owls is between 12% and %16.” This declaration is based on what’s called a “confidence interval,” in this case 14 +/- 2 and the confidence is 95%.

Data Science Research PapersWhen a pollster reports an estimate and a margin of error, in a way they’re reporting a 95% confidence interval. This means confidence intervals are a way of quantifying the uncertainty of an estimate. Further, if we take many different random samples, compute confidence intervals for each of those samples, 95% of those confidence intervals will be such that the population average would lie between those limits.

To demonstrate confidence intervals, we’ll use the well-known student survey R data set called survey from the MASS package. The data set has 237 observations and 12 variables. Our example will use two variables: Height – the height of the student in centimeters, and Wr.Hnd – the span of the writing hand in centimeters. We’ll use R’s lm() function to fit a simple linear regression model as shown below:

 

library(“MASS”)

 

# Fit linear model: response variable Height,

# predictor Wr.Hnd

 

lm1 <- lm(Height~Wr.Hnd,data=survey)

 

# Show computed linear model components

 

lm1

# Slope: 3.117, intercept: 113.954

#

# Call:

# lm(formula = Height ~ Wr.Hnd, data = survey)

#

#Coefficients:

#  (Intercept)       Wr.Hnd

#      113.954        3.117

 

You can easily derive the confidence interval for this model using the confint() function in R. You pass this function the linear model object as the first argument, along with the desired level of confidence as the second argument. The results of the function in the sample code indicate that you should be 95% confident that the value of the slope parameter, or in our case the Wr.Hnd predictor, is between 2.55 and 3.69. The common level of confidence values used are 90%, 95%, or 99%.

 

# Use confint() with current model and desired level of # confidence.

 

confint(lm1,level=0.95)

#                  2.5 % 97.5 %

# (Intercept) 103.225178 124.682069

# Wr.Hnd        2.547273 3.685961

 

confint(lm1,level=0.90)

#                    5 % 95 %

# (Intercept) 104.962490 122.944757

# Wr.Hnd        2.639469 3.593764

 

confint(lm1,level=0.99)

#                 0.5 % 99.5 %

# (Intercept) 99.805876 128.101371

# Wr.Hnd       2.365815 3.867418

 

Conclusion

This example is very simple, but it’s important to remember that a good portion of the data scientist’s work is just simple math. Understanding the fundamentals is essential if you want to interpret data effectively. It will also help you do a better job at data storytelling. As a data scientist, a major part of your job is to communicate clearly statistical concepts to people with various levels of statistical knowledge.

Daniel Gutierrez, ODSC

Daniel Gutierrez, ODSC

Daniel D. Gutierrez is a practicing data scientist who’s been working with data long before the field came in vogue. As a technology journalist, he enjoys keeping a pulse on this fast-paced industry. Daniel is also an educator having taught data science, machine learning and R classes at the university level. He has authored four computer industry books on database and data science technology, including his most recent title, “Machine Learning and Data Science: An Introduction to Statistical Learning Methods with R.” Daniel holds a BS in Mathematics and Computer Science from UCLA.

1