Causal Inference Using Synthetic Control: The Ultimate Guide Causal Inference Using Synthetic Control: The Ultimate Guide
In other posts, I’ve explained what causation is and how to do causal inference using quasi-experimental designs (DID, ITS, RDD). Almost for all... Causal Inference Using Synthetic Control: The Ultimate Guide

In other posts, I’ve explained what causation is and how to do causal inference using quasi-experimental designs (DIDITSRDD). Almost for all research methods, they have to meet two preconditions in order to generate meaningful insights:

1. the treated group looks like the control group (similarity for comparability);

2. a sufficiently large number of observations within each group (a large n).

These two preconditions lay the foundation for causal inference. However, is it possible to do causal inference if we only have one treated case and a few control cases? Even worse, what shall we do if there are no control cases with similar covariates as the treated case?

[Related Article: Causal Inference: An Indispensable Set of Techniques for Your Data Science Toolkit]

Under these situations, regression-based solutions (e.g., matching on key variables, or propensity score matching) perform poorly. Besides, other quasi-experimental designs such as the DID method require similar covariates between the treated and control groups and would generate a huge bias under these two scenarios.

In this post, I proudly present a statistical solution, the Synthetic Control Method (SCM), that is proposed by a group of political scientists like me. Honestly, the SCM has tremendous amount of causal potential but remains under-appreciated for now. It begins to capture some tractions in the industry as consumer-facing companies want to understand simulated consumer behaviors.


The SCM uses a weighted average of multiple cases from the “donor” pool to create an artificial control case.

Here is a simplified process:

  1. hypothetically, there are J + 1 units;
  2. the j(1) is the treated case (Attentiononly one treated case); units from j(2) to j(j+1) are unexposed cases that constitute the “donor pool”;
  3. pool from the donors and get a weighted average of the units
  4. pick the weighted value, W*, that minimizes the following loss function:

(Please refer to Abadie et al. 2010 and Abadie et al. 2015 for detailed descriptions.)

In summary, the SCM walks us through the process of generating the control case by providing formal criteria and procedures, which is something matching or other regression-based methods fail to achieve.

Another merit of the SCM is its ability to make the synthetic case looks like the treated case in key metrics in terms of the prior covariates and other post outcome predictors (Abadie et al. 2010). In other words, the SCM can provide apples-to-apples comparison.

When to use the SCM?

The SCM is an ideal choice for the following two scenarios:

  1. the social events take place at the aggregated level, e.g. county, state, province.
  2. only one treated case and a few control cases.

Due to these two traits, the SCM is the to-go method when it comes to large-scale program evaluation (e.g., California’s Tobacco Control ProgramEvaluating Place-Based Crime Interventions). Seriously, the industry should really add it to their DS toolkit.


In total, the SCM has three advantages.

  1. It assigns weights within 0 and 1 and so avoids extrapolation. Extrapolation means we do not bound the weights between 0 and 1, but it doesn’t make sense if the weight stays outside 100%. I mean how to interpret a weight of 200%? It doesn’t make any intuition at all.
  2. It lays out the selection criteria and explains the relative importance of each donor.
  3. The synthetic control case resembles the treated case a lot, almost the same.
  4. The choice of a synthetic control does not rely on the post-intervention outcomes, which makes it impossible to cherrypick the study design that may affect the conclusions.

How to use it?

1. Industry Application

  1. Program Evaluation.

2. Crime Study

3. Rare Events

  • Potentially, we could apply synthetic control to generate more cases for rare events since rare events lack a supply of data. Please check my other post on how to classify rare events using 5 Machine Learning classifiers.

2. R Implementation

In this section, I’ll replicate the results from Abadie (2003) that examines how terrorism affects economic output in the Basque Country, Spain. We will use the R Package ‘Synth’ for our analysis, and please refer to Synth: An R package for synthetic control methods in comparative case studies for the detailed math explanations and R instructions.

Step 0: Package, library, and Exploratory Data Analysis

# install and load package
library(Synth)# read the dataset "basque"
dim(basque) #774*17
Table 1

From Table 1, there are 774 observations and 17 variables (columns).

Columns 1–3: region number, name, and year (ID information)

DV: gdpcap (GDP per capita)

Other columns: there are 13 predictor variables.

Step 1: Data Preparations

The original dataset “basque” has a traditional panel format, and we need to read it in another form for using synth().

# set up different arguments
# foo:  dataprep.out <- dataprep(foo = basque,
 predictors = c(“school.illit”, “school.prim”, “school.med”,
 “school.high”, “school.post.high”, “invest”),
 predictors.op = “mean”, # the operator
 time.predictors.prior = 1964:1969, #the entire time frame from the #beginning to the end
 special.predictors = list(
 list(“gdpcap”, 1960:1969, “mean”),
 list(“sec.agriculture”, seq(1961,1969,2),”mean”),
 list(“sec.industry”, seq(1961,1969,2),”mean”),
 list(“sec.construction”, seq(1961,1969,2),”mean”),
 list(“sec.services.venta”, seq(1961,1969,2),”mean”),
 list(“popdens”, 1969, “mean”)),
 dependent = “gdpcap”, # dv
 unit.variable = “regionno”,#identifying unit numbers
 unit.names.variable = “regionname”,#identifying unit names
 time.variable = “year”,#time-periods
 treatment.identifier = 17,#the treated case
 controls.identifier = c(2:16, 18),#the control cases; all others #except number 17
 time.optimize.ssr = 1960:1969,#the time-period over which to optimize
 time.plot = 1955:1997)#the entire time period before/after the treatment

dataprep.out obtains four values (X1,X0,Z1,Z0) that allow us to derive causal inference.

X1: the control case before the treatment

X0: the control cases after the treatment

Z1: the treatment case before the treatment

Z0: the treatment case after the treatment

Step 2: run synth()

synth.out = synth(data.prep.obj = dataprep.out, method = “BFGS”)

To calculate the difference between the real Baseque region and the synthetic control as follows:

gaps = dataprep.out$Y1plot — (dataprep.out$Y0plot 
                                     %*% synth.out$solution.w)

To present some summary tables,

synth.tables = synth.tab(dataprep.res = dataprep.out,
                         synth.res = synth.out)
[1] "tab.pred" "tab.v"    "tab.w"    "tab.loss"

Note: synth.tables$tab.pred is a table comparing pre-treatment predictor values for the treated unit, the synthetic control, and all the units in the sample


To be honest, I wasn’t able to generate the same result as the original paper. The original code (synth.tables$tab.pred[1:5,]) looks at the first 5 covariates between the treated and the synthetic cases and finds they are very similar. Therefore, I extend the code to include 13 covariates and find the rest variables are quite similar except a few.

[Related Article: 5 Hands-on Skills Every Data Scientist Needs in 2020 – Coming to ODSC East 2020]

As noted above, the SCM allows us to check the relative importance of each unit.

synth.tables$tab.w[8:14, ]

As seen, unit number 10 Cataluna contributes 85.1% to the case, and unit number 14 Madrid (Comunidad De) contributes the rest 14.9%. All other control cases make no contribution.

# plot the changes before and after the treatment 
path.plot(synth.res=synth.out,dataprep.res = dataprep.out, 
          Ylab="real per-capita gdp (1986 USD, thousand)",Xlab="year",
          Ylim = c(0,12),Legend = c("Basque country", 
                                    "synthetic Basque country"),
          Legend.position = "bottomright")
gaps.plot(synth.res = synth.out, dataprep.res = dataprep.out,
 Ylab = “gap in real per-capita GDP (1986 USD, thousand)”, Xlab= “year”,
 Ylim = c(-1.5,1.5), Main = NA)

Leihua Ye

Leihua is a Ph.D. Candidate in Political Science with a Master's degree in Statistics at the UC, Santa Barbara. As a Data Scientist, Leihua has six years of research and professional experience in Quantitative UX Research, Machine Learning, Experimentation, and Causal Inference. His research interests include: 1. Field Experiments, Research Design, Missing Data, Measurement Validity, Sampling, and Panel Data 2. Quasi-Experimental Methods: Instrumental Variables, Regression Discontinuity Design, Interrupted Time-Series, Pre-and-Post-Test Design, Difference-in-Differences, and Synthetic Control 3. Observational Methods: Matching, Propensity Score Stratification, and Regression Adjustment 4. Causal Graphical Model, User Engagement, Optimization, and Data Visualization 5. Python, R, and SQL Connect here: 1. http://www.linkedin.com/in/leihuaye 2. https://twitter.com/leihua_ye 3. https://medium.com/@leihua_ye