Science Lesson: What Variables Should Be Controlled For?

A previous science lesson explained the concept of confounding and how variables are added to models to try to reduce the effects of confounding and thus sort out how much of an association is causal (the exposure caused the outcome) rather than confounding (e.g., some third factor caused the same people to have both the exposure and outcome). It was observed, but not explained, that “researchers just throw in whatever variables they happen to have, and claim this controls for confounding. This error is common throughout health research. It turns out that this approach is almost as likely to make the confounding worse as it is to reduce it.” But how does including a variable do more harm than good?

The simplest reason is that the variable simply might have no potential to be a “deconfounder.” A deconfounder is a variable whose inclusion seems like it should reduce a posited source of confounding, though it is not necessarily actually a confounder (the cause of confounding). So, for example, cannabis use might be a good (partial!) deconfounder when seeking to measure a causal relationship between teenage vaping and smoking. There is no reason to believe that using cannabis is a confounder, but it is a rough measure of “inclination to try drugs,” clearly a huge unmeasurable confounder. As noted in the previous science lesson, if controlling for this variable makes some of the association go away, it is safe to assume that a better measure for that inclination would make more or all of the association go away.

But some covariates, like parental smoking, are associated with the outcome while there is no particular reason to believe they affect the association. If they do change the effect estimate, there is no reason to believe they moved it closer to the true value rather than further away. Their effect is functionally random, as are most of the hundreds of variables that are available in many datasets. Indeed, if you include a variable that is literally a series of random numbers in a model, it will affect the effect estimate, perhaps making it closer to the true value and perhaps further from it. The same is true for many variables that are habitually included like a subject’s race or geographic region. This not a huge problem unless the researchers “model shop” through the large collection of variables and choose the ones whose effects they like. With this, they not only bias their effect estimate, but they create the illusion — in the minds of naive (i.e., most) readers — that they controlled for confounding.

A statistical model should be based on an underlying model of how the world works. If it suggests a variable is a deconfounder, that variable should be included in the calculation (even if it does not have much effect), and if the model does not suggest a reason why a variable might be a deconfounder, that variable should be excluded (even if including it turns out to have a big effect). If a variable that does not seem like it should matter has a big effect, that might be worth investigating, but it is not a reason to control for it.

Then there are variables that clearly make the effect estimate worse if they are included. The most important of these are measures of intermediate steps on the causal pathway of interest. So, for example, one of the reasons that alcohol consumption protects against cardiovascular disease is because it improves blood lipid profiles. If someone who is intent on disputing the benefits of drinking includes the HDL:LDL ratio as a control variable their statistical model, they are hiding some of the effect. Drinking improved that lipid ratio, and improving that ratio improved outcomes. “Controlling for” that ratio (i.e., trying to have the statistics ignore its effects) hides the effect you are trying to measure. (I offer that example because it is the one I stumbled across in school, leading me to figure out the points I present here and elsewhere.)

Epidemiologists have developed a method of flowchart diagrams to model causal pathways. There are simple theorems for determining which variables in the diagram should be controlled for, to aid those who cannot otherwise figure it out. But tobacco control research is such an intellectual backwater, and attracts so few well-trained epidemiologists, that such methods are ignored.

A good example is the previously reported ongoing dispute in which Brad Rodu and others called out a fatally flawed paper by Stanton Glantz and colleagues. (Details of the fight itself are fascinating, as will be reported in a future article.) The paper claimed to show that teenagers who had tried a cigarette, but were not committed smokers, when first observed (time t1) were more likely to progress to being smokers later (time t2) if they had ever vaped. This is potentially a much more reliable method than the typical one, merely observing the correlation between smoking and vaping at t1 (which is hopelessly confounded by the majority of the population who would never consider doing either). Indeed, if analyzed correctly, as Rodu did, this method shows vaping has no apparent effect.

Rodu pointed out that the original paper failed to control for how much someone had already smoked at t1 (which was also the time their vaping status was assessed). Some of the subjects had tried one puff of a cigarette, while others almost qualified as smokers already. This is an obvious potential confounder when trying to assess any effect of vaping on progressing to smoking at t2, and it turns out to have a huge effect. When this was controlled for, the claimed association went away. It turns out that far more of those who vaped were well on their way to being smokers; this, rather than the vaping, is why they went on to be smokers.

The authors of the paper protested that this constituted the error of controlling for a step on the causal pathway. They are clearly wrong. Since both that variable and vaping status were values from t1, it is literally impossible that one caused the other. A cause-effect relationship requires that the effect occur later in time, not simultaneously. Anyone who understands causation — or merely follows the recipes for sketching out causal diagrams — can see this.

Thinking in terms of causal diagrams also illustrates what the authors seemed to be groping for but garbled because they do not really understand what they are doing. When using a causal diagram as an aid to modeling, it is often useful to add earlier values for the measured variables. These will almost certainly be causes of the later values for the same variables, and they might be causes of other variables too. So in the present case, vaping status at an earlier time, t0, might(!) have been a cause of the quantity smoked at t1. It is fairly clear that the authors — having never seriously thought it through — are conflating the variable they had (vaping status at t1) with the latent (i.e., unobserved) variable, vaping status at t0. Perhaps they wished they could have analyzed the effect of vaping status at t0 on smoking at t2, but that was not what they actually did.

Vaping status at t0 could (theoretically!) have caused the quantity smoked at t1, which is indeed an intermediate step on the pathway from t0 to quantity smoked at t2 (which is exactly the reason it needs to be controlled for in the statistical model in the paper, which looks at associations between variables from t1 and t2). Thus, if the calculation had been to estimate how vaping status at t0 affects quantity smoked at t2, then quantity smoked at t1 would indeed be a step on the causal pathway as the original authors claimed. But that was not the model in the paper. What was calculated was “effect of vaping status at t1 on quantity smoked at t2,” so the authors apparently do not understand their own analysis.

Moreover, when they make this mistake they are effectively saying “the much higher smoking quantity among vapers at t1 was caused by vaping at t0, so you should not control for it.” But notice that this assumes the conclusion, that vaping causes increases in smoking quantity. The circular “logic” goes: “You cannot control for quantity smoked at t1 because higher quantities were caused by vaping. When you don’t control for it, our results suggest that vaping causes higher smoking quantity. This means that we were right to not control for it.” Um, yeah.

Finally, if you actually analyze the data in the way the authors apparently had in mind, then there is nothing useful at all about this study. It is just the same old worthless cross-sectional study that has been done dozens of times: Measure vaping and smoking status at t1, observe that they are associated, and pretend the obvious confounding does not exist. Also pretend that vaping status was measured at t0 rather than t1 so the causation can only run one direction (i.e., ignore the possibility that smoking more caused vaping). Yes, technically they measured smoking at t2, but — and this is the whole point — since that value is basically just a proxy for smoking at t1, this is a red herring.

Some readers may have gotten a bit lost in all that. Here is a summary of the takeaways: Some variables should be controlled for, but most should not. Most of the others just add noise to the results, but they can be used to bias the results if inclusion is picked and chosen by the researchers to bolster their preferred result. Some variables are certain to make the estimate less accurate if they are controlled for, particularly if they are steps on the causal pathway of interest. It takes a bit of thinking — though seldom all that much — to figure out which variables fall into which of those categories. Such thinking is standard practice in proper epidemiology, but is basically never done in tobacco control or other “public health” research. In particular the Glantz cabal does not seem to be able to figure out how to do it (or are pretending not to). Finally, it turns out they accidentally designed a study method that could better assess the effect of vaping on smoking and, when done properly, it provided evidence there is no or little effect.