Stats Can’t Make Modeling Decisions
If effect sizes of coefficient are really small, can you interpret as no relationship? Coefficients are very significant, which is expected with my large dataset. But coefficients are tiny (0.0000001). Can I conclude no relationship? Or must I say there is a relationship, but it’s not practical?
First, as several people mentioned on Reddit, you have to distinguish between a small coefficient and a small effect size. The size of the coefficient depends on the units it is expressed in. For example, in a previous article I wrote about the relationship between a baby’s birth weight and its mother’s age (“Are first babies more likely to be light?“). With weights in pounds and ages in years, the estimated coefficient is about 0.017 pounds per year.At first glance, that looks like a small effect size. But the average birth weight in the U.S. is about 7.3 pounds, and the range from the youngest to the oldest mother was more than 20 years. So if we say the effect size is “about 3 ounces per decade”, that would be easier to interpret. Or it might be even better express the effect in terms of percentages; for example, “A 10 year increase in mother’s age is associated with a 2.4% increase in birth weight.”
So that’s the first part of my answer:
Expressing effect size in practical terms makes it easier to evaluate its importance in practice.
Statistical analysis can inform modeling choices, but it can’t make decisions for you. As a reminder, when you make a model of a real-world scenario, you have to decide what to include and what to leave out. If you include the most important things and leave out less important things, your model will be good enough for most purposes.
But in most scenarios, there is no single uniquely correct model. Rather, there are many possible models that might be good enough, or not, for various purposes.
If you want to argue that an effect SHOULD be included in a model, you can justify that decision (using classical statistics) in two steps:
2) Show that the p-value is small, which at least suggests that the observed effect is unlikely to be due to chance. (Some people will object to this interpretation of p-values, but I explain why I think it is valid in “Hypothesis testing is only mostly useless“).In my study of birth weight, I argued that mother’s age should be included in the model because the effect size was big enough to matter in the real world, and because the p-value was very small.
If you want to argue that it is ok to leave an effect out of a model, you can justify that decision in one of two ways:
1) If you apply a hypothesis test and get a small p-value, you probably can’t dismiss it as random. But if the estimated effect size is small, can use background information to make an argument about why it is negligible.
2) If you apply a hypothesis test and get a large p-value, that suggests that the effect you observed could be explained by chance. But that doesn’t mean the effect is necessarily negligible. To make that argument, you need to consider the power of the test. One way to do that is to find the smallest hypothetical effect size that would yield a high probability of a significant test. Then you can say something like, “If the effect size were as big as X, this test would have a 90% of being statistically significant. The test was not statistically significant, so the effect size is likely to be less than X. And in practical terms, X is negligible.”
So far I have been using the logic of classical statistics, which is problematic in many ways.
Alternatively, in a Bayesian framework, the result would be a posterior distribution on the effect size, which you could use to generate an ensemble of models with different effect sizes. To make predictions, you would generate predictive distributions that represent your uncertainty about the effect size. In that case there’s no need to make binary decisions about whether there is, or is not, a relationship.
Or you could use Bayesian model comparison, but I think that a mostly misguided effort to shoehorn Bayesian methods into a classical framework. But that’s a topic for another time.
Update May 4, 2016: I got a few questions about this article that I thought I should answer here.
Q: Doesn’t N matter too? More likely to find significance in larger samples, making justification that much more important.
A: For the arguments I outlined, we don’t need to know N directly. You are right that if N is very large, an effect might be statistically significant even if it is very small. But then you could apply Negative Argument #1.
If N is very small, an effect might not be statistically significant even if it is substantial. In that case you wouldn’t be able to make a strong argument either way. The affirmative argument would fail because the apparent effect could plausibly be explained by chance. The negative argument would fail because the test was underpowered (specifically, in Negative Argument #2, X would be big).
Q: For the posterior to be useful for decision making, you need to know that the model is causally correct as well, don’t you?
A: Good question! It depends on what kind of decision-making you are talking about.
For example, suppose you find that preschool education predicts future earnings. The effect might be directly causal, or it might be that children who get preschool education have other advantages.
If the task is to predict future earnings, you would probably want to include preschool education in the model, and it would probably help, causal or not.
But suppose you are considering an intervention, like universal preschool education. In that case, you definitely want to know whether the effect is causal. If it isn’t, the intervention might do little or no good.
Update May 5, 2016. In response to another question, I wrote, “For any (non-trivial) real-world scenario, there is no one unique correct model; rather, there are many models that might be good enough (or not) for various purposes. Model choice can be informed by quantitative factors, but there might be several contradictory factors, as well as value judgments.”
Someone asked me to give examples of contradictory criteria and value judgments. Here’s my reply:
Let’s stick with predicting future earnings, and let’s say there are about 10 predictive factors you are considering, like SAT scores, high school grades, parent’s socioeconomic status, etc.
With just 10 factors, there are more than 1000 models to choose from. For each candidate model, you might consider these criteria:
1) How good the predictions are. This one is obviously, but there are several ways to define it, depending on whether you want to minimize absolute error, relative error, mean squared error, or some other cost function.
2) How many factors are in the model. You might prefer a simpler model, but there are different ways you might define “simple”.
3) How early different factors can be measured. If elementary school grades predict almost as well as high school grades, you might prefer elementary school grades because they are available earlier.
4) How easily different factors can be measured. You might prefer a model that runs on cheap data, even if its not quite as good as a more expensive model.
5) How interpretable the model is. If you are trying to explain something about the factors that contribute to earnings, you might prefer a model that makes sense to people (although you might make a bad choice if you let your preconceptions drive the bus).
6) How causal the model is. If you have background knowledge about which factors are more likely to be causal, you might want to focus on those factors, depending on the purpose of the model.
And I could go on. But I don’t think I’m saying anything truly profound, just that there is no objective, uniquely correct way to navigate tradeoffs like this.
(If you like, this is basically what Kuhn said in “Objectivity, Value Judgment, and Theory Choice“. He talked about theory choice rather than model choice, but I think that’s the same thing.)
Originally posted at allendowney.blogspot.com