Regression Discontinuity Design: The Crown Jewel of Causal Inference Regression Discontinuity Design: The Crown Jewel of Causal Inference
Background In a series of posts (here, here, here, here and here), I’ve explained why and how we should run social experimentations. However, it’s not possible... Regression Discontinuity Design: The Crown Jewel of Causal Inference


In a series of posts (herehereherehere and here), I’ve explained why and how we should run social experimentations. However, it’s not possible to do social experiments all the time, and researchers have to identify causal effects by other observational and quasi-experimental methods.

[Related Article: Causal Inference: An Indispensable Set of Techniques for Your Data Science Toolkit]

As an old quasi-experiment technique, RDD comes as a hotshot recently after a long period of dormancy. In this post, I’ll focus on its basics, design ideology, assumptions, Potential Outcomes Framework (POF), merits, and R illustration.

  1. What is RDD?

It is a quasi-experimental method with a pretest-posttest design.

By setting up a “cutoff” point, researchers choose the subjects slightly above the threshold line as the treatment group and the ones slightly below the line as the control group.

Since these two groups are geographically close to each other, researchers can control for all other cofounding variables except the treatment condition. So, we can attribute any differences in the outcome variable to the presence of the treatment.

This “as-if” randomization component makes us believe RDD can derive causal inference as accurately as a Randomized Control Trial (RCT).

In statistical form:


Di: whether the subject has received a treatment

Xi: forcing (aka, score or running) variable

c: the cutoff point

In plain language, subject i receives the treatment if its Xi value stands above the cutoff value.

In a hypothetical example, we are interested in the effect of receiving a merit-based scholarship on students’ academic success. The admin sets a GPA of 3.5 as the cutting line: anyone whose GPAs above 3.5 receive the award but not the ones below 3.5.

To answer the question, we have to rely on an RDD framework for causal inference. First, we choose students who are barely eligible for the scholarship (whose GPAs fall in the range of 3.51 to 3.55) as the treatment group. Then, we select the ones slightly below the cutoff line as the control group (whose GPAs between 3.45 to 3.49).

Since the marginal differences are negligibly small, we believe students in these two groups aren’t fundamentally different.

Let’s look at the graphical illustration.

Plot 1


D_i = 1: people who receive the scholarship

D_i = 0: people who do not

c: GPA = 3.5

Plot 2

From plot 2, we can see students with the scholarship in blue and the ones without the scholarship in red.

Plot 3: POF and LATE

After fitting regression lines, we observe a discontinuity in the two regression lines. Thus, we conclude that the scholarship does make a difference.

Let’s put the RDD method in the POF framework:

The estimand is the difference of two regression functions at the cutoff point c. In other words, RDD measures the average treatment effect at the cutoff point (local causal effect), not the individual effects.

2. Assumptions

There are two assumptions.

  • Continuity of Conditional Regression Functions

E[Y(0)|X=x] and E[Y(1)|X=x], are continuous in x. Also, the conditional distribution function is smooth in the covariate.

This is important because we want to rule out the possibility that other covariates cause the discontinuity (jump in the regression lines) at the cutoff point.

  • Continuity of Conditional Distribution Functions

are continuous in x for all y.

3. When and Why RDD?

  • Great alternatives when randomization is not possible.

True, we should run experiments when available. However, it is not possible even for big tech companies (e.g., see this post on why Netflix uses quasi-experiment methods).

  • Ideal for rule-based questions (e.g., election, program evaluation, etc.).
  • Easy to apply in practice with less strict assumptions.
  • Strong internal validity and causal effect if applied appropriately.

Let’s test the results in R. For illustration purposes, and I’ll use simulation data.

This is the regression model for the above example:

#generate a sample data
#cutoff point = 3.5
GPA <- runif(1000, 0, 4)
future_success <- 10 + 2 * GPA + 10 * (GPA>=3.5) + rnorm(1000)#install and load the package ‘rddtools’
data <- rdd_data(future_success, GPA, cutpoint = 3.5)# plot the dataset
plot(data,col = “red”,
 cex = 0.1, 
 xlab = “GPA”, 
 ylab = “future_success”)
# estimate the sharp RDD model
rdd_mod <- rdd_reg_lm(rdd_object = data, slope = “same”)
lm(formula = y ~ ., data = dat_step1, weights = weights)Residuals:
    Min      1Q  Median      3Q     Max 
-3.3156 -0.6794 -0.0116  0.6732  2.9288Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 16.96270    0.06878  246.63   <2e-16 ***
D           10.13726    0.12352   82.07   <2e-16 ***
x            2.00150    0.03330   60.10   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 1.025 on 997 degrees of freedom
Multiple R-squared:  0.9577, Adjusted R-squared:  0.9577 
F-statistic: 1.13e+04 on 2 and 997 DF,  p-value: < 2.2e-16

The estimated effect is 10.13, which is significant at the 0.001 level.

# plot the RDD model along with binned observations
plot(rdd_mod,cex = 0.1, 
 col = “red”, 
 xlab = “GPA”, 
 ylab = “future_success”)

This is a visual representation of the effect as we can see the “jump” at the cutoff point.

[Related Article: RAPIDS Forest Inference Library: Prediction at 100 Million Rows per Second]

Originally Posted Here

Leihua Ye

Leihua is a Ph.D. Candidate in Political Science with a Master's degree in Statistics at the UC, Santa Barbara. As a Data Scientist, Leihua has six years of research and professional experience in Quantitative UX Research, Machine Learning, Experimentation, and Causal Inference. His research interests include: 1. Field Experiments, Research Design, Missing Data, Measurement Validity, Sampling, and Panel Data 2. Quasi-Experimental Methods: Instrumental Variables, Regression Discontinuity Design, Interrupted Time-Series, Pre-and-Post-Test Design, Difference-in-Differences, and Synthetic Control 3. Observational Methods: Matching, Propensity Score Stratification, and Regression Adjustment 4. Causal Graphical Model, User Engagement, Optimization, and Data Visualization 5. Python, R, and SQL Connect here: 1. http://www.linkedin.com/in/leihuaye 2. https://twitter.com/leihua_ye 3. https://medium.com/@leihua_ye