Science Lesson: Model Shopping – The Real Problem With Epidemiology

Before data turns into study results, it must be run through a statistical model chosen be the researchers. While it is possible to keep this simple, that is still a choice; the data never speaks for itself. The choices matter, and so “shopping around” among models creates opportunities for typical minor fudging and occasional out-and-out lying. Understanding this is useful for anyone who wants to understand the research about vaping — or about nutrition, environmental pollutants or most any other health-related research in the news.

For a few types of studies the best statistical model is fairly obvious. For a randomized experiment, researchers just count up the outcome events in each group. Other bells-and-whistles can be reported also, but if that simple statistic is not reported, it becomes fairly obvious that the researchers are attempting to mislead. But for most epidemiology, and social science more generally, there is no simple obvious choice, and so model choices will typically create misleading results without it being obvious.

For example, consider incorporating age in a epidemiological model, which is usually necessary because age has a large effect on most outcomes. The research might put subject age into the statistical calculation as a continuous variable, forcing the model to make particular “assumptions:” for example, that the effect of being 40 rather than 30 is the same as the effect of being 50 rather than 40. Or the researchers can divide subjects into two or more age groups (e.g. 18-25, 26-40, etc.), and then either include a variable for each, or analyze each group separately (called “stratifying”).

The exposure itself can be measured in different ways (e.g., ever tried vaping, vaped at least once recently, vapes every day), as can the outcome. All available data can be analyzed or just some of it. For most variables the researchers also have the option of just leaving it out of the model. There can be hundreds of choices to make, many of which offer literally infinite options. So which one do researchers choose?

Ideally researchers would decide on the right model based on theory and previous results, without looking at their data first, and then run it. While this not completely practical, it is possible to do a reasonable imitation of this. In practice, however, researchers usually run many plausible statistical models and cherrypick one that produces the “best” results for their data. They then report as if it were the only model they considered, often backfilling a story for why it seemed right from the start. Since the model that produces the strongest associations in the data is usually considered “best,” this means research consistently overstates the associations. It is akin to choosing a photograph from a few snapshots because it makes you look most attractive. This has obvious advantages, but presenting a valid estimate of your average attractiveness is not one of them.

Sometimes the problem is not a mere creep toward overstating relationships. For junk science researchers like tobacco controllers, the common practice is to search through every possible model option, even those that are not plausible candidates for the best model, to find the one that produces the result that best supports their political goals.

This problem has become much talked-about recently, mostly in psychology and other social sciences (the epidemiology field is still largely silent about the problem). It goes by many names, but the most explanatory is “unreported multiple hypothesis testing.” When I first started studying the problem, when few people were talking about it, I labeled it “publication bias in situ”. Publication bias is often thought of as the “file drawer effect,” in which studies with boring or seemingly wrong results are filed away rather than appearing in journals. This causes the literature as a whole to be biased toward “interesting” and “proper” results. But far more publication bias takes place when a biased choice among many different models is reported. This bias exists in situ — within each study report itself — rather than only at the level of the literature as a whole.

The discussion of this problem often focuses on fiddling with model choices to make a result become “statistically significant.” But this is the wrong way to look at it. Epidemiology is a science of measurement, not a way to discover if a phenomenon merely exists. Of course it is misleading to claim an exposure causes a statistically significant doubling of a disease risk when it really has no effect, and this may lead to bad choices. But it is equally misleading and harmful to claim it quadruples the risk when it really only doubles it. It is a reasonable rule-of-thumb to assume that if “the literature” shows that risk caused by an exposure is X, the real value is much closer to half of X.

This is not just a problem with blatantly dishonest researchers. Most every researcher I have ever observed in action, including those who are trying to do honest research, do this. They rationalize it based on needing to “listen to the data” or not knowing what model to use until they see which “best fits the data.” But what this means is that they are usually picking a model that makes the relationships in their data seem as strong as possible.

When blatantly dishonest researchers intentionally take advantage of this, the problem becomes much worse. I recently wrote in detail about a paper by Stanton Glantz from a year ago, in which he claimed that smoking rates among U.S. minors did not decrease as their vaping increased. The authors picked a dataset and made the absurd assumption that smoking prevalence would have followed a particular linear decline after the introduction of vaping, even though historical data clearly showed no such trend. Glantz and others tobacco controllers play lots of games with the truth, but this is probably their worst crime against science: starting with a conclusion they want to support and then hunting for the data and a concocted model that can be used to best support it.

They get away with it because readers never get to see results from unreported models, and can seldom get the data to run other models themselves, so they do not have affirmative evidence that the single reported model reported produced biased results. This is also true for journal reviewers, who see only what the readers see. (When I review a paper for a journal I always insist that the authors report — at least for purposes of dong the review — the results of the other models they ran. It never happens.) Even when something about the reported model seems quite odd, there is little a reader can do beyond noting the oddity.

Sadly, most critics of the Glantz analysis quibbled about details rather than pointing out that the model was an absurd concoction from the start. Because picking favorable models is normal behavior, and often impossible to prove, people in public health and related sciences seldom even notice when the model is absurd even when they are criticizing a result. Moreover, researchers know that admitting this dirty secret of the field — that model choices are biased — could be a threat to their own credibility.

Follow Dr. Phillips on Twitter