Exploring the Central Limit Theorem in R Exploring the Central Limit Theorem in R
The Central Limit Theorem (CLT) is arguably the most important theorem in statistics. It’s certainly a concept that every data scientist should fully understand.... Exploring the Central Limit Theorem in R

The Central Limit Theorem (CLT) is arguably the most important theorem in statistics. It’s certainly a concept that every data scientist should fully understand. In this article, we’ll go over some basic theory of the CLT, explain why it’s important for data scientists, and present some R code that explores the theorem’s characteristics.

CLT Theory

Often, the CLT is confused with “the law of large numbers.” This law states that as the size of a sample increases, the sample mean will become a more accurate estimate of the population mean. The main difference between the two theorems is that the law of large numbers pertains to a single sample, meanwhile, the CLT pertains to the distribution of sample means.

The CLT states that, given a sufficiently large sample size from a population, the mean of all samples from the same population will be approximately equal to the mean of the original population. It also states that as you increase the number of samples and the sample size, the distribution of all of the sample means will approximate a normal distribution (aka Gaussian distribution) — no matter what the population distribution is. This distribution is referred to as the “sampling distribution.”

In other words, the CLT states that the sampling distribution of the sample mean approximates normal distribution. It does so regardless of the distribution of the sampled population, provided the sample size is sufficiently large. This enables data scientists to make statistical inferences about the sample based on normal distribution properties, even if it is drawn from a population that is not normally distributed.

 

CLT for Data Scientists

There are lists floating around like “What are the top questions to detect a fake data scientist?” One particular question on such lists caught my eye: “What is the Central Limit Theorem, and why is it important?” After considering it, I doubt many data scientists can answer the question properly. But they’re likely using the underlying concept on a regular basis.

So why is the CLT important? Because it’s at the core of what every data scientist does — make statistical inferences about data.

If we can claim normal distribution, there are a number of things we can say about the data set. In data science, we often want to compare two different populations through statistical significance tests, i.e. hypothesis testing. Using the CLT and knowledge of the Gaussian distribution, we’re able to assess our hypothesis about the two populations.

In addition, the concepts of regularly-used statistical techniques like confidence intervals and hypothesis testing are based on the CLT. There are some limitations, however. You can’t use CLT when sampling isn’t random, or when the underlying distribution doesn’t have a defined mean and variance.

As a data scientist, you should be able to explain this theorem and understand why it’s so important. To achieve this understanding further, I suggest you study the mathematical foundation of the CLT. Also, check out the Kahn Academy instructional video on the CLT.

 

Using R to Explore the CLT

Now let’s illustrate the CLT. We simulate an experiment 2,000 times, taking random draws from a Binomial distribution with a 0.05 probability success rate. We use four for sample sizes: 20, 100, 500, and 1,000. For the mean of each sample size, we also calculate the Z-score, i.e. a measure of how many standard deviations below or above the population mean a raw score is.

n <- 4     # Number of trials (population size)
s <- 2000  # Number of simulations
m <- c(20, 100, 500, 1000)

EX <- n*p
VarX <- n*p*(1-p)

Z_score <- matrix(NA, nrow = s, ncol = length(m))
for (i in 1:s){
  for (j in 1:length(m)){ # loop over sample size
    samp <- rbinom(n = m[j], size = n, prob = 0.05)
    sample_mean <- mean(samp) # sample mean
    # Calculate Z score for mean of each sample size
    Z_score[i,j] <- (sample_mean-EX)/sqrt(VarX/m[j]) 
  }
}

Now let’s plot a series of four stacked histograms of the Z-score — one for each sample size — and add the density curve from the normal distribution to each histogram.

# Display distribution of means
par(mfrow=c(4,1)) 
for (j in 1:4){
  hist(Z_score[,j], xlim=c(-5,5), 
   freq=FALSE, ylim=c(0, 0.5),
     ylab="Probability", xlab="", 
     main=paste("Sample Size =", m[j]))
  # Density curve
x <- seq(-4, 4, by=0.01)
  y <- dnorm(x)
  lines(x, y, col="blue") 
}


Daniel Gutierrez

Daniel Gutierrez

Daniel D. Gutierrez is a practicing data scientist who’s been working with data long before the field came in vogue. As a technology journalist, he enjoys keeping a pulse on this fast-paced industry. Daniel is also an educator having taught data science, machine learning and R classes at the university level. He has authored four computer industry books on database and data science technology, including his most recent title, “Machine Learning and Data Science: An Introduction to Statistical Learning Methods with R.” Daniel holds a BS in Mathematics and Computer Science from UCLA.

Open Data Science - Your News Source for AI, Machine Learning & more