Erich is a speaker for ODSC East 2020 this April 13-17! Be sure to check out his talk, “Methods for Using Observational Data to Answer Causal Questions,” there!
Is drinking red wine associated with decreased risk of heart disease? Does drinking red wine prevent heart disease? Should I advise patients to drink some red wine with dinner every night, in order to lower their risk of heart disease?
Causal analysis covers many topics, but can be briefly summarized as data analysis methods used for answering causal questions. Of the above questions, it may be obvious that the first is not distinctly causal, per the oft repeated mantra “correlation is not causation”. The second question is obviously a causal question, as prevention is a type of causal relation: to prevent something is to actively cause it to not happen. The third question may be less obvious, but this is also one of the most important and prevalent types of research questions both in health and other domains, namely “Should I do that thing?” Or, “what will happen if I do this thing instead of that thing?” It indirectly asks for causal information, namely the consequences of certain actions or interventions, and so if we’re feeling generous, we can call it a causal question.
Predictive data analysis is highly effective for answering predictive questions in many domains, including science, entertainment, business, and health. Many of our real world problems are fundamentally causal in nature, however, and causal analysis of observational data is much more controversial. The style guide for publications in the Journal of the American Medical Association, for example, explicitly states that you may not use causal language, or even language that is suggestive of causality, unless you are reporting a randomized controlled trial. Even meta analysis of randomized controlled trials must frame their findings as “associations” instead of estimates of causal effects. When we consider where causal analysis fits in the larger machine learning ecosystem, we can perhaps see why it is so controversial.
Machine learning methods are generally classified as supervised or unsupervised. Supervised methods require that you give them a collection of problems with correct answers to use as training, while unsupervised methods operate without ever being given verified correct answers to the kinds of problems we want them to solve. Classification and regression is typically supervised with a training data set, and we can get unbiased estimates of their performance using hold out samples and other techniques. Unsupervised methods like clustering are not amenable to that kind of validation, since there may be no “correct” answer, but may still yield information that is useful or valuable to the user.
Like clustering, causal analysis is unsupervised. But like classification and regression, answers can be closer or further from the truth. In this way, problems such as estimating the total causal effect of one variable on another, with observational data, are unsupervised regression problems. In principle, our causal analysis could be right, or it could be wrong, but how can we know? Because causal analysis typically does not have access to labeled training data, we can’t leverage the very powerful domain-agnostic tools that have made prediction, and supervised learning generally, so effective. There is no silver bullet for causal analysis.
Instead, causal analysis has a toolbelt. While there are no domain-agnostic solutions, there is a plethora of causal analysis methods that are each correct under different conditions. There is also a parallel plethora of software, often open source, available for applying those methods. The difficulty comes in determining which method is appropriate for answering which causal question with which data sets and background knowledge.
What sort of tools are in this toolbelt? Just a few of the more commonly used methods are: instrumental variables, mediation analysis, negative controls, difference-in-differences, propensity score matching, and fitting directed acyclic graphs. There are also many, many other methods that are less commonly used, and the number is growing because this is an active area of methodological research. Software packages are available that enable easy deployment of numerous causal analysis methods, but caution is advised. While these methods can be used correctly, it is also very easy to use them incorrectly. This problem is compounded by the fact that, being unsupervised, validation is difficult: if you apply a method incorrectly or inappropriately, there is no holdout sample AUC or other performance metric that will reveal this mistake. We are forced to rely on other validation methods, which can also vary from one causal analysis to another.
This may seem daunting, but it is not insurmountable. The first step is to identify a concrete causal question that you would like to answer. It’s okay to be ambitious: what do you really, truly want to know? This is often more difficult than it might seem. Once that is identified, what data do you have available that have the essential measurements required to approach that question at all? Only after both of those questions are answered do you need to start worrying about which of these many analysis methods to try and use. The question and data will often rule out most methods, leaving only a few that need to be studied in enough detail to be deployed correctly and successfully, and validated appropriately.
While using a complicated and heterogeneous toolbelt is certainly less efficient than a silver bullet, it is our only option when it comes to most causal questions. Like a certain popular superhero, what we lack in overwhelming power we can make up with intelligence, creativity, and having just the right tool for the specific problem at hand.
About the author/ODSC East speaker:
Dr. Erich Kummerfeld is a Research Assistant Professor at the Institute for Health Informatics in the University of Minnesota. His primary research interest is in statistical and machine learning methods for discovering causal relationships, with a special focus on discovering causal latent variable models. His work includes (1) developing novel algorithms for discovering causal relationships and latent variables, (2) proving theorems about the properties of causal discovery and latent variable discovery algorithms, (3) performing benchmark simulation studies to evaluate features of the algorithms that are difficult or impossible to evaluate by other means, and (4) applying these novel algorithms to health data in order to inform the development of new treatments.