Multiple Comparisons

Frequentist methods of inference require experimenters (and reviewers) to guard against multiple comparisons. The problem is as follows: Each test is designed to avoid give false positives only at a certain rate but this means that when many tests are used false positives become progressively more likely. Indeed presence of false positives when the critical p-value is 0.05 should follow a Binomial(0.05, N) distribution where N is the number of comparisons.

As a result either a correction is applied or a more appropriate test is used. The simplest correction is called the Bonferroni and simply means dividing the critical value by the number of comparisons.

The Bonferroni correction is, in my opinion, something of an admission of defeat on the part of the experimenter. There is usually an appropriate generalization for a given test. For example the ANOVA is a generalized t-test and the MANOVA is a generalized ANOVA. Devising a new method to deal with a complex procedure is rarely practical. Strictly speaking the Bonferroni isn’t inappropriate (although it is conservative) but when you see it that’s generally an indication that someone looked at their experiment and realized they didn’t have the statistical tools to analyze it.

A further danger of the multiple comparisons problem is that mining for p-values (sometimes called p-hacking) is basically impossible to guard against. Even if an experiment reports only a few comparisons there’s no way of knowing how many comparisons were made and discarded before the experimenter chose to report those ones. This is a serious problem. Evidence of p-hacking has been found in several fields simply by analyzing the distributions of reported p-values and looking for a “bump” near a popular critical value.

Interestingly there is no correction made for multiple comparisons in Bayesian analysis. To someone familiar with traditional p-value based tests this seems immediately absurd and suspicious. Nothing about Bayesianism seems to guard against false positives! Recall, however, that the difference in methods is caused by a difference in core philosophy. There is no such thing as a false positive in Bayesian analysis so it makes no sense to try and prevent them. The knowledge implied by the data is what it is, regardless of how many comparisons are made. The Bayesian tests have no discrete “positive” or “negative” results.

This isn’t to say that Bayesian tests can’t be manipulated but doing so often means making an explicit change to the model. For example you could assume a very informative prior but since you have to define your model at some point doing so is obvious to the reader. Gelman and Hill (2006) suggest that when constructing a hierarchical model one should feel free to try multiple models. This does open up the analysis to manipulation and the only safeguard I’m familiar with is to apply model checking techniques to ensure that the model is not an absurd one. From the position of Bayesian philosophy, though, none of the models are wrong and it is job of the reader to decide if they are acceptable.

Next week I’ll be teeming with a lot of news about the binomial distribution and its extended family.