Why Causation Matters in Data Science
Modelingposted by ODSC Community August 18, 2020 ODSC Community
Why does causation in data science matter? Inferring causality is vital to deriving actionable insights in product data science, similar to more established fields like public policy. Without understanding the causal impact, we cannot make influential product changes that will alter outcomes or behaviors in-line with product or policy goals. In my experience, because of an overreliance on predictive algorithms, many data scientists miss key causal relationships in consumer data. Prediction is useful in some domains. But in consumer analytics, prediction often becomes a proxy for causal relationships, resulting in misleading or incorrect inferences.
[Related article: Machine Learning: Active Failures and Latent Conditions]
In consumer analytics, we generally want to understand why customers are engaging, returning, purchasing, downloading, liking, or participating in some relevant behavior. Beyond understanding why, we want to either ‘nudge’ or change behavior.
The Thaler nudge is when we build products that encourage consumers to engage in a desired behavior, by making the desired behavior the default and forcing consumers to opt-out. While nudging works for small behavior change, nudging fails for more impactful consumer behavior change. To change consumer behavior, we generally need to understand why a consumer is behaving in the way that they are and what’s driving them to behave in a certain way. This is why we must rely on causal inference and behavioral psychology to move the needle. If you are trying to push major behavior change, you need to find the triggers of the behavior, understand the motivators, and make the behavior change easy, predictable, and self-actualizable. Behavioral change is undoubtedly hard and even if you are very successful, expect to see less than 10% of your user population making a desired change.
While A/B tests are used to determine causal effects for simple product changes, they are often not used to understand larger consumer behavioral processes. Many data scientists rely on A/B tests to determine the answers to one-off questions, rather than developing deep and nuanced theories about consumer behavior and then subsequently using A/B tests to test these theories. Even still, data science is focused on predictive algorithms – regression and classification. While prediction gets better with more data, causal inference, which is design-driven, generally does not.
Never in history have we been able to collect such massive amounts of precisely measured behavioral data. Justifiably, prediction is seeing a renaissance. With the rapid growth of massive amounts of behavioral data, the use of predictive approaches to answer causal inference questions is burgeoning. In the space of consumer analytics, this growth is especially pernicious because predictive methods fail to offer true insight.
The example of longevity gets at the crucial difference between the prediction and causal inference. Predicting how long you will live, while it might be interesting and useful for actuarial purposes, will not tell you what to change to live longer. The reason for this is because while variables can be correlated and good predictors, they may not be causally related.
Predicting how long you will live is also much easier, then understanding all of the causes of why you live for 83.6 years. That’s the crux of the difference between the prediction and causal inference. Causation, even though more difficult to pin down, is actionable, meaning when we can understand why something is happening, we can often change it. For instance, if drinking red wine three times a week leads people on average to live three years longer. Then, we can all drink red wine three times a week to increase our average lifespans. Prediction is about forecasting the future based on the current world. We can predict longevity, but we never really know why we lived that long.
Most human behavior, even in web applications, is social. Social processes and human behavior, in general, are rarely simple, with thousands, and potentially millions, of causal factors existing that might influence a behavioral outcome. Causal chains can be long with many intervening variables. How long you live is based on genetic factors, the environment (from infancy through old age), relationships, attitudes, and how all these complex factors interact. Social processes, similar to human behavior, is not a problem, in that there is generally no simple cause-and-effect relationship between variables. Most social processes are defined by ‘mutual interaction’ or where the variables can drive each other intermittently in a long temporal process.
Social processes at essence, which characterized many web products, are not understood by prediction, because there is often no clear predictive outcome, rather a series of interconnected variables. Determining which set of variables have a large causal impact on a process and which have a minor causal impact or are only correlated becomes of core importance to changing these processes.
While some particular questions and areas lead themselves toward predictive approaches, prediction does not tend to lend itself to actionable insights and prescriptive changes. The future of product data science is not prediction or causal inference, but the integration of predictive tools to further causal inference methods, such as in statistical matching and uplift modeling, to better understand casual relationships and chains of human behavior.
Joanne Rodrigues is an experienced data scientist and enterprise manager with master’s degrees in mathematics (London School of Economics), political science (U.C. Berkeley) and demography (U.C. Berkeley) and a bachelor’s degree in international economics (Georgetown University). Her passion is to analyze large sets of structured, semi-structured, and unstructured data to solve real-world problems. She has six years of experience applying machine learning/statistical algorithms to derive business insights (in healthcare, gaming). She pioneered new analytics techniques at Sony Playstation and led all of MeYou Health’s data science efforts. She is the author of Product Analytics: Applied Data Science techniques for Actionable Insight, forthcoming August 2020 for Addison-Wesley Professional and the founder of the health technology company, Clinic Price Check.