# Regression Discontinuity Design: The Crown Jewel of Causal Inference

ModelingStatisticscasual inferenceposted by Leihua Ye December 17, 2019 Leihua Ye

Background

In a series of posts (here, here, here, here and here), I’ve explained why and how we should run social experimentations. However, it’s not possible to do social experiments all the time, and researchers have to identify causal effects by other observational and quasi-experimental methods.

*[Related Article: Causal Inference: An Indispensable Set of Techniques for Your Data Science Toolkit]*

As an old quasi-experiment technique, RDD comes as a hotshot recently after a long period of dormancy. In this post, I’ll focus on its basics, design ideology, assumptions, Potential Outcomes Framework (POF), merits, and R illustration.

**What is RDD?**

It is a quasi-experimental method with a pretest-posttest design.

By setting up a “cutoff” point, researchers choose the subjects **slightly** above the threshold line as the treatment group and the ones **slightly** below the line as the control group.

Since these two groups are geographically close to each other, researchers can control for all other cofounding variables except the treatment condition. So, we can attribute any differences in the outcome variable to the presence of the treatment.

This “as-if” randomization component makes us believe RDD can derive causal inference as accurately as a Randomized Control Trial (RCT).

In statistical form:

*where*

*Di: whether the subject has received a treatment*

*Xi: forcing (aka, score or running) variable*

*c: the cutoff point*

In plain language, subject i receives the treatment if its Xi value stands above the cutoff value.

In a hypothetical example, we are interested in the effect of receiving a merit-based scholarship on students’ academic success. The admin sets a GPA of 3.5 as the cutting line: anyone whose GPAs above 3.5 receive the award but not the ones below 3.5.

To answer the question, we have to rely on an RDD framework for causal inference. First, we choose students who are barely eligible for the scholarship (whose GPAs fall in the range of 3.51 to 3.55) as the treatment group. Then, we select the ones slightly below the cutoff line as the control group (whose GPAs between 3.45 to 3.49).

Since the marginal differences are negligibly small, we believe students in these two groups aren’t fundamentally different.

Let’s look at the graphical illustration.

*where*

*D_i = 1: people who receive the scholarship*

*D_i = 0: people who do not*

*c: GPA = 3.5*

From plot 2, we can see students with the scholarship in blue and the ones without the scholarship in red.

After fitting regression lines, we observe a discontinuity in the two regression lines. Thus, we conclude that the scholarship does make a difference.

The conclusion is: give students more financial support!

Let’s put the RDD method in the POF framework:

The estimand is the difference of two regression functions at the cutoff point c. In other words, RDD measures the average treatment effect at the cutoff point (local causal effect), not the individual effects.

There are two assumptions.

*Continuity of Conditional Regression Functions*

E[Y(0)|X=x] and E[Y(1)|X=x], are continuous in x. Also, the conditional distribution function is smooth in the covariate.

This is important because we want to rule out the possibility that other covariates cause the discontinuity (jump in the regression lines) at the cutoff point.

**Continuity of Conditional Distribution Functions**

are continuous in x for all y.

**3. When and Why RDD?**

- Great alternatives when randomization is not possible.

True, we should run experiments when available. However, it is not possible even for big tech companies (e.g., see this post on why Netflix uses quasi-experiment methods).

- Ideal for rule-based questions (e.g., election, program evaluation, etc.).
- Easy to apply in practice with less strict assumptions.
- Strong internal validity and causal effect if applied appropriately.

Let’s test the results in R. For illustration purposes, and I’ll use simulation data.

This is the regression model for the above example:

#setwd(“/Users/andy/desktop”) #generate a sample data #cutoff point = 3.5 GPA <- runif(1000, 0, 4) future_success <- 10 + 2 * GPA + 10 * (GPA>=3.5) + rnorm(1000)#install and load the package ‘rddtools’ #install.packages(“rddtools”) library(rddtools) data <- rdd_data(future_success, GPA, cutpoint = 3.5)# plot the dataset plot(data,col = “red”, cex = 0.1, xlab = “GPA”, ylab = “future_success”)

# estimate the sharp RDD model rdd_mod <- rdd_reg_lm(rdd_object = data, slope = “same”) summary(rdd_mod)Call: lm(formula = y ~ ., data = dat_step1, weights = weights)Residuals: Min 1Q Median 3Q Max -3.3156 -0.6794 -0.0116 0.6732 2.9288Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 16.96270 0.06878 246.63 <2e-16 *** D 10.13726 0.12352 82.07 <2e-16 *** x 2.00150 0.03330 60.10 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 1.025 on 997 degrees of freedom Multiple R-squared: 0.9577, Adjusted R-squared: 0.9577 F-statistic: 1.13e+04 on 2 and 997 DF, p-value: < 2.2e-16

The estimated effect is 10.13, which is significant at the 0.001 level.

```
# plot the RDD model along with binned observations
plot(rdd_mod,cex = 0.1,
col = “red”,
xlab = “GPA”,
ylab = “future_success”)
```

This is a visual representation of the effect as we can see the “jump” at the cutoff point.

*[Related Article: RAPIDS Forest Inference Library: Prediction at 100 Million Rows per Second]*

*Originally Posted Here*