   # A Quick Look Into Bootstrapping

Machine LearningModelingRStatisticsTools & LanguagesbootstrappingStatisticsposted by Leihua Ye December 3, 2019

Executive Summary As a resampling method, bootstrapping allows us to generate statistical inferences about the population from a single sample. Learn... • As a resampling method, bootstrapping allows us to generate statistical inferences about the population from a single sample.
• Learn to bootstrap in R.
• Bootstrapping lies the foundation for several machine learning methods (e.g., Bagging. I’ll explain Bagging in a follow-up post).

[Related Article: Discovering 135 Nights of Sleep with Data, Anomaly Detection, and Time Series]

Big Questions:

• After an A/B testing, to what extent can we trust our small sample can represent the entire population of our customers?
• If we repeatedly sample the same size, how would our estimates vary?
• If we obtain different estimators after repeated sampling, can we gauge the distribution of the population?
• If we don’t know the distribution of our variables, what solutions do we have?

### What is bootstrapping?

Bootstrapping is a resampling method that allows us to gauge the distribution of the population from one sample distribution. We can estimate the population variance from a single sample in the following steps:

1. Draw N data points from the sample with replacements; the same observation can be resampled multiple times.
2. Refit the statistical models to the resampled bootstrapped data.
3. Calculate sample variance

### Why bootstrap?

As data scientists, we have to make statistical inferences about the population distribution from a small sample.

For example, we conduct an A/B Testing, collect a sample of 100 customers, and find Version A generates more website traffic. The question is, can we interpret the results as all customers will find Version A more appealing?

It is possible what works for the customers in the sample may not work for the customers in the population.

This is a critical question because it’s not feasible to survey the entire population for our research questions.

To derive valid statistical inference, we have to rely on bootstrapping. Due to various reasons, we will create a large standard deviation of a point estimate when we sample, which may bias the estimator. We need to improve the accuracy by calculating the standard deviation of the estimator.

As a nonparametric estimator, bootstrapping comes handy and allow us to estimate the uncertainty of an estimator.

### How to bootstrap in R?

Hypothetically, we roll a dice with two outcomes: head and tail. There is a 60% chance we will get the head each time. After 50 times, we obtain the following binomial distribution.

```# create a binomial distribution
# You may get slightly different results
n <- 50
coin_flips <- rbinom(50, 1, p=0.6)
phat <- mean(coin_flips)
sd_hat <- sqrt(phat * (1-phat) / 50 )
print(sprintf(“Mean = %f, SD = %f”, phat, sd_hat)) “Mean = 0.600000, SD = 0.069282”```

Following the classical approach, we calculate the mean and variance using a binomial distribution. The mean is 0.6 and the standard error is 0.069.

Now, let’s create a bootstrapped data and compare the results of these two methods.

```# Bootstrap 1000 times
B <- 1000
bootstrap_samples <- sapply(1:1000, function(i) mean(coin_flips[sample(n, replace=TRUE)]))# Plot the bootstrapped estimator
hist(bootstrap_samples, freq=FALSE, breaks=20, main=”Bootstrap estimates of phat”)
curve(dnorm(x, phat, sd_hat), add=TRUE, col=”red”, lwd=2)
abline(v=0.6,col="black",lwd=4)```

Let’s play with the bootstrapped data a little bit.

As explained above, it’s possible to sample the same observations repeatedly. So, how many repeated observations?

```set.seed(1)
n=1000
included_obs = length(unique(sample(1:1000, replace = TRUE)))
included_obs
missing_obs = n-included_obs;missing_obs
missing_obs/n 639
 361
 0.361```

As can be seen, there are 1000 observations, 639 observations are unique, and 361 (or 36.1%) missed from the bootstrap sample.

```set.seed(1)
n=1000
RC_shots = c(rep(1,50),rep(0,51))
bootstrap_samples <- sapply(1:1000, function(i) mean(RC_shots[sample(101, replace=TRUE)]))
hist(bootstrap_samples, freq=FALSE, breaks=20, main=”Bootstrap Estimates of Sample Mean”)
quantile(bootstrap_samples,c(.025,.975))#95% C.I. end points    2.5%     97.5%
0.4059406 0.5940594```

The 95% bootstrap confidence interval is [0.4059406, 0.5940594].

[Related Article: 3 Common Regression Pitfalls in Business Applications] 