

The Turf War Between Causality and Correlation In Data Science: Which One Is More Important?
ModelingStatisticscausalityCorrelationposted by Leihua Ye January 6, 2020 Leihua Ye

Data scientists have tried to differentiate causality from correlation. Last month alone, I’ve seen 20+ posts referencing the catchphrase “correlation is not causality.” What they actually want to say is correlation is not as good as causality.
[Related Article: Discovering 135 Nights of Sleep with Data, Anomaly Detection, and Time Series]
The tendency of “bias towards” causality among the data world is understandable. It takes more training in data skills (e.g. potential outcomes framework, hypothesis testing, counterfactual, etc.) than correlation research.
On a personal note, I make a judgment call of someone’s work based on the strength of his causal story.
Causal inference generates actionable insights into end-users, pointing the directions for the product team.
However, this shouldn’t be the reason why we treat correlational study lightly with less appreciation. There are a ton of business scenarios require both types of research.
Fact Checks: What is correlation? What is causality?
- What Is Correlation?
In the simplest form, Events A and B happen together but without causal claims. That is, we don’t know if A causes B, or the other way around.
For example, an online booking agency just spent 1 million and came up with a new web design on November 10th, and there was a surge of website traffic one week later.
The designers claim the credit and attribute the traffic surge to the fresh outlook they developed.
Playing the safe card, data scientists provide three alternative explanations.
#1 Increased spending in digital marketing from the last three quarters finally pays off.
#2 Improved macroeconomic conditions boost customers’ willingness to travel.
#3 Right timing. It’s mid-November, and customers start planning for family trips for Christmas.
None or all of these alternatives could be true. Unfortunately, correlation study only tells how strongly these events are related but can’t tell the source of causality.
To tell them apart, we need to derive causal inferences.
- What Is Causality?
Causality suggests Events A and B occur together and either A → B, B → A, or A ← → B (two-way causation).
In the previous example, we need to eliminate the alternative hypotheses before estimating the effects of the new web design.
Great job, UX designers! Your data science colleagues just proved you’re statistically significant.
To rule out the alternatives, there are two available paths that researchers follow: experimental and observational designs.
Following the first path, researchers come up with a field experiment to test for the effects of an intervention (aka. a treatment). If a full-fledged experiment isn’t an option (e.g. time constraint), we settle for a quasi-experimental design.
In the web example, we can run an A/B testing. From historical records, we find Californian customers of the booking agency exhibit similar behavioral patterns as the ones in New York. So, researchers randomly choose California as the treatment group and New York as the control group following the listed steps:
1. Customers with Californian IP address will be directed to the new web design when they try to visit the website;
2. Customers with NYC IP address will be directed to the old web version;
3. Check if there is any significant difference between these two groups.
This is the fundamental design ideology with A/B testing, and there are other variants developed out of this idea. The devil is in the details. The success of an A/B testing lies in its details, and I’m going to write another post on this topic. stay tuned!
The downsides of experimentation should also be taken into considerations. It takes a long time to collect and analyze experimental data, which makes it less desirable if we need a swift answer.
As a tradeoff, data scientists take the second path by analyzing any existing observational data (e.g. users log, survey). Here are some frequently used statistical tools for causal inference using observational data: matching, propensity score matching, Difference-in-Difference, Regression Discontinuity Design, etc.
Observational design is considered a last resort. It has no control over how data is created and requires other statistical assumptions. As a result, the observational method results in imprecise, sometimes wrong, results.
For example, researchers compare the performances of the experimental and non-experimental methods in estimating advertisement effects on Facebook. They find the observational method performs poorly (full access to the research).
If possible, I’d recommend data scientists to adopt the following steps:
- Run a mini-experiment for a short duration to produce some preliminary results.
- Keep the experiment going for a long time
- Monitor for any updates: anything new? different? or the same?
By doing so, we can have some immediately usable, though imperfect, findings to support the product team and calibrate on a later time if anything new pops up.
This is probably the biggest DIFFERENCE between the industry and the academic:
it’s OK to take long time to do research for the academic purpose, but the business needs quick solutions.
On a side note: tech companies (e.g., Facebook, Netflix, Airbnb) have increasingly adopted experimental designs to generate new contents and promote online marketing.
Causality And Correlation: Which One Is More Important? They Are Equally Important.
Why Causality?
Causal inference brings benefits for today and tomorrow.
For today, causality researches show how users engage with our products and quantify the effects of such engagements, producing actionable insights for today.
Human beings change their behaviors as they engage with the product, which makes it important for companies to adapt side-by-side. Long-term causal work helps us track such changes, predicting the future trend for tomorrow.
Why Correlation?
Correlational research has a broader market with more business scenarios. It is so because correlational study requires less “picky” statistical assumptions.
For example, big retail companies arrange store layout and put similar products together. As far as I know, Target, Walmart, and Costco rearrange store layout following associational analysis.
You may have heard of the Diaper-Beer syndrome: new dads grab a cold one after shopping for diapers for their newborns on the way out of the store. So, businesses put Pampers and Bud Light nearby to increase the sale.
Honestly, shopping is way too heavy-duty for men.
The Diaper-Beer syndrome is a business scenario that we care more about WHAT products sell together and less so about WHY so.
Causal research is expensive and time-consuming. Some items may be correlated for good reasons but more often for no reason. They simply do, and it’s OK for not knowing why. So, a strong correlation is good enough.
Things are related for a reason.
Things are related for no reason at all.
When and How To Use?
For Causality
- Why customers only browse the product catalog but never finish the transaction on Walmart?
- How would the new web design affect customer retention and satisfaction?
- Why users disengage with the product?
- Why customers in emerging markets only shop offline but never online?
- For all other questions related to whys and hows.
For Correlation
- What other products sell together besides Pampers and Bud Light?
- Where to put food court at a Costco store?
- Where to open another Starbucks, another Amazon warehouse?
- Life science
Doctors don’t understand how certain types of diseases develop and have to rely on the associated signs and symptoms to diagnose.
- Personalized recommendation system
Amazon adopts an item-to-item collaborative filtering system. It analyzes the past browsing/purchasing history and recommends associated merchanises to the customers.
[Related Article: 3 Common Regression Pitfalls in Business Applications]
- For other millions of questions that don’t need to know whys and hows, correlation design is preferred.
Originally Posted Here