

Causal Inference Using Synthetic Control: The Ultimate Guide
Modelingcausal inferenceposted by Leihua Ye March 24, 2020 Leihua Ye

In other posts, I’ve explained what causation is and how to do causal inference using quasi-experimental designs (DID, ITS, RDD). Almost for all research methods, they have to meet two preconditions in order to generate meaningful insights:
1. the treated group looks like the control group (similarity for comparability);
2. a sufficiently large number of observations within each group (a large n).
These two preconditions lay the foundation for causal inference. However, is it possible to do causal inference if we only have one treated case and a few control cases? Even worse, what shall we do if there are no control cases with similar covariates as the treated case?
[Related Article: Causal Inference: An Indispensable Set of Techniques for Your Data Science Toolkit]
Under these situations, regression-based solutions (e.g., matching on key variables, or propensity score matching) perform poorly. Besides, other quasi-experimental designs such as the DID method require similar covariates between the treated and control groups and would generate a huge bias under these two scenarios.
In this post, I proudly present a statistical solution, the Synthetic Control Method (SCM), that is proposed by a group of political scientists like me. Honestly, the SCM has tremendous amount of causal potential but remains under-appreciated for now. It begins to capture some tractions in the industry as consumer-facing companies want to understand simulated consumer behaviors.
Basics
The SCM uses a weighted average of multiple cases from the “donor” pool to create an artificial control case.
Here is a simplified process:
- hypothetically, there are J + 1 units;
- the j(1) is the treated case (Attention: only one treated case); units from j(2) to j(j+1) are unexposed cases that constitute the “donor pool”;
- pool from the donors and get a weighted average of the units
- pick the weighted value, W*, that minimizes the following loss function:
(Please refer to Abadie et al. 2010 and Abadie et al. 2015 for detailed descriptions.)
In summary, the SCM walks us through the process of generating the control case by providing formal criteria and procedures, which is something matching or other regression-based methods fail to achieve.
Another merit of the SCM is its ability to make the synthetic case looks like the treated case in key metrics in terms of the prior covariates and other post outcome predictors (Abadie et al. 2010). In other words, the SCM can provide apples-to-apples comparison.
When to use the SCM?
The SCM is an ideal choice for the following two scenarios:
- the social events take place at the aggregated level, e.g. county, state, province.
- only one treated case and a few control cases.
Due to these two traits, the SCM is the to-go method when it comes to large-scale program evaluation (e.g., California’s Tobacco Control Program, Evaluating Place-Based Crime Interventions). Seriously, the industry should really add it to their DS toolkit.
Advantages
In total, the SCM has three advantages.
- It assigns weights within 0 and 1 and so avoids extrapolation. Extrapolation means we do not bound the weights between 0 and 1, but it doesn’t make sense if the weight stays outside 100%. I mean how to interpret a weight of 200%? It doesn’t make any intuition at all.
- It lays out the selection criteria and explains the relative importance of each donor.
- The synthetic control case resembles the treated case a lot, almost the same.
- The choice of a synthetic control does not rely on the post-intervention outcomes, which makes it impossible to cherrypick the study design that may affect the conclusions.
How to use it?
1. Industry Application
- Program Evaluation.
- Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program
- The Economic Costs of Conflict: A Case Study of the Basque Country
2. Crime Study
- A Synthetic Control Approach to Evaluating Place-Based Crime Interventions
- Impact of drought on crime in California: A synthetic control approach
3. Rare Events
- Potentially, we could apply synthetic control to generate more cases for rare events since rare events lack a supply of data. Please check my other post on how to classify rare events using 5 Machine Learning classifiers.
2. R Implementation
In this section, I’ll replicate the results from Abadie (2003) that examines how terrorism affects economic output in the Basque Country, Spain. We will use the R Package ‘Synth’ for our analysis, and please refer to Synth: An R package for synthetic control methods in comparative case studies for the detailed math explanations and R instructions.
Step 0: Package, library, and Exploratory Data Analysis
# install and load package install.packages("Synth") library(Synth)# read the dataset "basque" data("basque")#EDA dim(basque) #774*17 basque[1:10,]
From Table 1, there are 774 observations and 17 variables (columns).
Columns 1–3: region number, name, and year (ID information)
DV: gdpcap (GDP per capita)
Other columns: there are 13 predictor variables.
Step 1: Data Preparations
The original dataset “basque” has a traditional panel format, and we need to read it in another form for using synth().
# set up different arguments # foo: dataprep.out <- dataprep(foo = basque, predictors = c(“school.illit”, “school.prim”, “school.med”, “school.high”, “school.post.high”, “invest”), predictors.op = “mean”, # the operator time.predictors.prior = 1964:1969, #the entire time frame from the #beginning to the end special.predictors = list( list(“gdpcap”, 1960:1969, “mean”), list(“sec.agriculture”, seq(1961,1969,2),”mean”), list(“sec.energy”,seq(1961,1969,2),”mean”), list(“sec.industry”, seq(1961,1969,2),”mean”), list(“sec.construction”, seq(1961,1969,2),”mean”), list(“sec.services.venta”, seq(1961,1969,2),”mean”), list(“sec.services.nonventa”,seq(1961,1969,2),”mean”), list(“popdens”, 1969, “mean”)), dependent = “gdpcap”, # dv unit.variable = “regionno”,#identifying unit numbers unit.names.variable = “regionname”,#identifying unit names time.variable = “year”,#time-periods treatment.identifier = 17,#the treated case controls.identifier = c(2:16, 18),#the control cases; all others #except number 17 time.optimize.ssr = 1960:1969,#the time-period over which to optimize time.plot = 1955:1997)#the entire time period before/after the treatment
dataprep.out obtains four values (X1,X0,Z1,Z0) that allow us to derive causal inference.
X1: the control case before the treatment
X0: the control cases after the treatment
Step 2: run synth()
synth.out = synth(data.prep.obj = dataprep.out, method = “BFGS”)
To calculate the difference between the real Baseque region and the synthetic control as follows:
gaps = dataprep.out$Y1plot — (dataprep.out$Y0plot
%*% synth.out$solution.w)
gaps[1:3,1]
To present some summary tables,
synth.tables = synth.tab(dataprep.res = dataprep.out,
synth.res = synth.out)
names(synth.tables)
[1] "tab.pred" "tab.v" "tab.w" "tab.loss"
Note: synth.tables$tab.pred is a table comparing pre-treatment predictor values for the treated unit, the synthetic control, and all the units in the sample
synth.tables$tab.pred[1:13,]
To be honest, I wasn’t able to generate the same result as the original paper. The original code (synth.tables$tab.pred[1:5,]) looks at the first 5 covariates between the treated and the synthetic cases and finds they are very similar. Therefore, I extend the code to include 13 covariates and find the rest variables are quite similar except a few.
[Related Article: 5 Hands-on Skills Every Data Scientist Needs in 2020 – Coming to ODSC East 2020]
As noted above, the SCM allows us to check the relative importance of each unit.
synth.tables$tab.w[8:14, ]
As seen, unit number 10 Cataluna contributes 85.1% to the case, and unit number 14 Madrid (Comunidad De) contributes the rest 14.9%. All other control cases make no contribution.
# plot the changes before and after the treatment
path.plot(synth.res=synth.out,dataprep.res = dataprep.out,
Ylab="real per-capita gdp (1986 USD, thousand)",Xlab="year",
Ylim = c(0,12),Legend = c("Basque country",
"synthetic Basque country"),
Legend.position = "bottomright")
gaps.plot(synth.res = synth.out, dataprep.res = dataprep.out,
Ylab = “gap in real per-capita GDP (1986 USD, thousand)”, Xlab= “year”,
Ylim = c(-1.5,1.5), Main = NA)
Reference and Further Resources
- Synth: An R package for synthetic control methods in comparative case studies
- Package ‘Synth’
- MicroSynth: A Tutorial
- Abadie et al. 2003. The economic costs of conflict
- Abadie et al. 2010. Synthetic control methods for comparative case studies
- Abadie et al. 2015. Comparative politics and the synthetic control method
Originally Posted Here