fbpx
A Quick Look Into Bootstrapping A Quick Look Into Bootstrapping
Executive Summary As a resampling method, bootstrapping allows us to generate statistical inferences about the population from a single sample. Learn to bootstrap in... A Quick Look Into Bootstrapping
  • As a resampling method, bootstrapping allows us to generate statistical inferences about the population from a single sample.
  • Learn to bootstrap in R.
  • Bootstrapping lies the foundation for several machine learning methods (e.g., Bagging. I’ll explain Bagging in a follow-up post).

[Related Article: Discovering 135 Nights of Sleep with Data, Anomaly Detection, and Time Series]


Big Questions:

  • After an A/B testing, to what extent can we trust our small sample can represent the entire population of our customers?
  • If we repeatedly sample the same size, how would our estimates vary?
  • If we obtain different estimators after repeated sampling, can we gauge the distribution of the population?
  • If we don’t know the distribution of our variables, what solutions do we have?

What is bootstrapping?

Bootstrapping is a resampling method that allows us to gauge the distribution of the population from one sample distribution. We can estimate the population variance from a single sample in the following steps:

  1. Draw N data points from the sample with replacements; the same observation can be resampled multiple times.
  2. Refit the statistical models to the resampled bootstrapped data.
  3. Calculate sample variance

Why bootstrap?

As data scientists, we have to make statistical inferences about the population distribution from a small sample.

For example, we conduct an A/B Testing, collect a sample of 100 customers, and find Version A generates more website traffic. The question is, can we interpret the results as all customers will find Version A more appealing?

It is possible what works for the customers in the sample may not work for the customers in the population.

This is a critical question because it’s not feasible to survey the entire population for our research questions.

To derive valid statistical inference, we have to rely on bootstrapping. Due to various reasons, we will create a large standard deviation of a point estimate when we sample, which may bias the estimator. We need to improve the accuracy by calculating the standard deviation of the estimator.

As a nonparametric estimator, bootstrapping comes handy and allow us to estimate the uncertainty of an estimator.

How to bootstrap in R?

Hypothetically, we roll a dice with two outcomes: head and tail. There is a 60% chance we will get the head each time. After 50 times, we obtain the following binomial distribution.

# create a binomial distribution 
# You may get slightly different results 
n <- 50
coin_flips <- rbinom(50, 1, p=0.6)
phat <- mean(coin_flips)
sd_hat <- sqrt(phat * (1-phat) / 50 )
print(sprintf(“Mean = %f, SD = %f”, phat, sd_hat))[1] “Mean = 0.600000, SD = 0.069282”

Following the classical approach, we calculate the mean and variance using a binomial distribution. The mean is 0.6 and the standard error is 0.069.

Now, let’s create a bootstrapped data and compare the results of these two methods.

# Bootstrap 1000 times 
B <- 1000
bootstrap_samples <- sapply(1:1000, function(i) mean(coin_flips[sample(n, replace=TRUE)]))# Plot the bootstrapped estimator
hist(bootstrap_samples, freq=FALSE, breaks=20, main=”Bootstrap estimates of phat”)
curve(dnorm(x, phat, sd_hat), add=TRUE, col=”red”, lwd=2)
abline(v=0.6,col="black",lwd=4)

Let’s play with the bootstrapped data a little bit.

As explained above, it’s possible to sample the same observations repeatedly. So, how many repeated observations?

set.seed(1)
n=1000
included_obs = length(unique(sample(1:1000, replace = TRUE)))
included_obs
missing_obs = n-included_obs;missing_obs
missing_obs/n[1] 639
[1] 361
[1] 0.361

As can be seen, there are 1000 observations, 639 observations are unique, and 361 (or 36.1%) missed from the bootstrap sample.

How about the confidence interval?

set.seed(1)
n=1000
RC_shots = c(rep(1,50),rep(0,51))
bootstrap_samples <- sapply(1:1000, function(i) mean(RC_shots[sample(101, replace=TRUE)]))
hist(bootstrap_samples, freq=FALSE, breaks=20, main=”Bootstrap Estimates of Sample Mean”)
quantile(bootstrap_samples,c(.025,.975))#95% C.I. end points    2.5%     97.5% 
0.4059406 0.5940594
bootstrapped sample

The 95% bootstrap confidence interval is [0.4059406, 0.5940594].

[Related Article: 3 Common Regression Pitfalls in Business Applications]


Happy reading and learning!

Originally Posted Here

Leihua Ye

Leihua Ye

Leihua is a Ph.D. Candidate in Political Science with a Master's degree in Statistics at the UC, Santa Barbara. As a Data Scientist, Leihua has six years of research and professional experience in Quantitative UX Research, Machine Learning, Experimentation, and Causal Inference. His research interests include: 1. Field Experiments, Research Design, Missing Data, Measurement Validity, Sampling, and Panel Data 2. Quasi-Experimental Methods: Instrumental Variables, Regression Discontinuity Design, Interrupted Time-Series, Pre-and-Post-Test Design, Difference-in-Differences, and Synthetic Control 3. Observational Methods: Matching, Propensity Score Stratification, and Regression Adjustment 4. Causal Graphical Model, User Engagement, Optimization, and Data Visualization 5. Python, R, and SQL Connect here: 1. http://www.linkedin.com/in/leihuaye 2. https://twitter.com/leihua_ye 3. https://medium.com/@leihua_ye

1