Problems with p-values

Excerpt:

When we want to know if a statistical test is significant or not, we usually turn to the p-value. In psychological research, almost universally, we want a p-value that’s less than 0.05. If the p-value is smaller than this, then we say that there’s a statistically significant effect. But there’s this interesting historical past to the p-value that makes this particular approach pretty messy.

P-values are no gold standard. The way we use them today means p-values have a probability distribution. Just one could be an outlier, and the way publishing works, probably is. It's the reason for so many 'too good to be true' findings---they are.

No headings in this article!

filed under:

betterment wealth-architecture

on-being-fruitful on-thinking-and-reasoning

Article Status: Complete (for now).

When we want to know if a statistical test is significant or not, we usually turn to the p-value. In psychological research, almost universally, we want a p-value that’s less than 0.05.¹ If the p-value is smaller than this, then we say that there’s a statistically significant effect.

What this means is that we can never make a claim that there is no effect. We can only ever claim that there is evidence for an effect, or that we don’t know if there is or isn’t an effect. We make it fancy by saying there is “no evidence” for an effect, but this is often very misleading. However, problems aside, this sort-of makes sense. We might not care if there was no-effect if we’re testing some drug. We only care if the drug is effective, not proving that it’s not effective.

But there’s this interesting historical past to the p-value that makes this particular approach pretty messy in another, more fundamental way.

The simple explanation of a p-value is that it’s very simply the probability that the test statistic you got is the same as one you’d get by random chance.

Although used earlier, the p-value was popularised by a bloke named Ronald Fisher. He basically thought we could use the p-value as a way of indicating how strong the evidence was that there’s something happening in a dataset that wasn’t due to random chance. You see, the likelihood that you get a strong correlation by random chance is quite high if you’re just looking at just a handful of people, so your p-value is going to indicate that: it’s going to be quite a high number. But if you’re looking at 150 people, then the chance your strong correlation is due to some random sampling error is lower, and the p-value will be lower to reflect that.

So, you could look at your p-value and you can say things like “it seems pretty likely that we have an effect in our dataset here”, or “it doesn’t seem likely that this is more than random chance in the dataset we’re looking at”.

This is not how we use the p-value, though. Unfortunately, Fisher’s method doesn’t tell people when we can actually make a decision about whether we have an effect, nor can we generalise outside of our dataset. We can only make comments on the strength of evidence in the sample we’re looking at. It’s a descriptive statistic, like a mean or a standard deviation.

Along came Jerzy Neyman and Egon Pearson with a solution. Many people wanted to be able to make a formal decision about whether to conclude there was some effect happening in a dataset or not. Manufacturers were particularly interested. They wanted a way to use statistical tests for quality assurance purposes—how could they make standardised decisions about whether there was something wrong with their widgets or not, without fussing with probabilities and minimising the chance that their decision was a mistake?

Neyman and Pearson found a neat answer. First, you assume by default some null-hypotheses. Let’s say you’re a manufacturer pumping out widgets. You might assume that your factory machines are working fine, and they’re pumping out quality widgets. You can now create an alternative-hypothesis. You might say, the alternative is the factory machines are broken, and we’re pumping out messed up widgets. With these two hypotheses in mind, you can run a test on your widgets—too many messed up widgets would tell you that you should swap from your default assumption that everything is fine to your alternative—your machines have broken. In statistical terms, we can either fail to reject the default assumption—the null-hypothesis—or we reject the null in favour of our alternative-hypothesis.

In doing this, there are three possible outcomes:

You correctly keep assuming the machines and widgets are fine when they’re fine, or correctly switch to assuming the factory machinery is broken it’s broken and pumping out messed up widgets.
You could mistakenly swap from the assumption nothing’s broken to the assumption things are broken even though they aren’t (reject a true null-hypothesis; a “type-I error”).
Or you could mistakenly keep assuming nothing’s broken when something actually is messed up (fail to reject a false null-hypothesis: a “type-II error”).

Now that you know all this, you can set some criteria for how often you’re happy to make these type-I and type-II errors. You can say, “You know, 5% of the time I’m happy to risk making a type-I error where I incorrectly conclude that the widget-making machines are broken when they’re actually working fine.” This is setting your alpha (α) level. The alpha level is your threshold for the probability of rejecting the null hypothesis when it’s actually true. Here, the 5% chance translates to the very commonly used α of 0.05.¹ We’re saying that we’re happy to accept a 5% chance of stopping our production line because of a false alarm—a type-I error. We use the p-value to determine whether alpha is exceeded or not—a p-value lower than the alpha means we haven’t exceeded our desired probability. A p-value higher than the alpha means we have, and so it’s time to shut everything down and see what’s wrong.

So, now, every now and again we run our quality control process—selecting some widgets and seeing what proportion of them are messed up. We ask if the probability of this many messed-up widgets in a batch exceeds our alpha—if so we reject the null: there’s probably some issue with our machines and we might want to shut production down and have a look. On the other hand, even if our proportion of broken widgets is higher than usual, as long as it doesn’t exceed alpha, we won’t reject the null—we probably reckon it’s just random chance causing more broken widgets than usual, not something systemic.

We see the alpha and the p-value a lot in scientific research, but there’s another important part of this process—controlling the risk of a type-II error. In this case, it’s the chance we make the mistake of assuming everything everything is fine with the machines when they’re actually broken and pumping out messed up widgets. This error rate is called beta (β). The power of the test we’re going to be conducting, which is 1 - β, represents the probability that we correctly detect when the machines are broken when they’re actually broken. You might decide that you want to have 80% power, which means we’ll set our β at 0.20. This means we’re accepting a 20% chance of overlooking the problem when the machines are not performing properly.² To work out the power of our test, we need to know a few things, like what alpha we’ve chosen, the sample size (number of widgets we’re looking at), and some features of the population (all of the widgets, not just the sample we’re looking at).

Very critically, we need to work this out before we do the test, to know how big of a sample size we need to achieve the power we want while also making our desired alpha level meaningful. More on this in a bit.

Now, you’ll notice that in much research, we rarely hear anything about the beta. We only ever seem to specify an alpha and talk about the p-values. The reason is that the type-II error rate completely depends on us knowing what the likely size of an effect is in a population. In the context of a factory floor, we can know very clearly what too many broken widgets looks like. But in lots of scientific research, it’s very hard to know how big effects might be in the population. How many people really do think in a certain way under certain conditions? How well does a drug do across all people in all contexts? So researchers just don’t bother, or instead have a guess at the effect size based on rules of thumb. Then, they go ahead and focus exclusively on the alpha—the criterion that tells us when we might have made a type-I error, that we thought we saw something in our data when it was actually just due to chance.

The problem is that, without an informed beta, this doesn’t really mean anything. The beta tells us about the power: the ability of the test to detect an effect whenever we look at a sample of the population. The alpha only tells us about the result of a single test. The p-value is like a snapshot—what evidence this particular test provided. Another way to put it is, the beta tells us how likely the p-value is. If we have greater power, then we’ll have a higher likelihood of producing a significant result (p-value less than the chosen alpha level) when the effect is real (when the null-hypothesis is false). Low power means we’ll have a higher likelihood of finding no evidence (p-value greater than alpha) when the effect is real (the null-hypothesis is false).

The implication here is that, under this model, the p-value itself has a probability distribution. So we want to do lots of tests, and produce lots of p-values, so that we can tell whether we have a significant effect or not, because in our example there’s a 20% chance that we’re overlooking the problem! One in every five p-values could be leading us astray.

On the factory floor, this is no problem. We can do tests on every batch of widgets we produce. The odd misleading p-value isn’t going to throw us off very much. In other kinds of research though, we almost never run the same test more than once because it’s hard to replicate the conditions of our tests. And even when we do, because it’s hard to publish non-significant results, if the replicated tests are non-significant, we only ever see the misleading p-values.

This is one of the major contributors to the replication crisis, a fairly recent insight across almost all scientific disciplines that many ‘significant’ results are misleading. It’s why we see news articles that tell us one year that ‘red wine is bad for you’ and the next year that it’s good.

You see, the way we’ve come to use the p-value is some kind of hybrid of Fisher’s probability and the Neyman-Pearson formal decision. We’ll include a comparison of the p-value to the alpha to claim existence of an effect, we’ll report the p-value itself, and we’ll sometimes make relative claims about the evidence using these things, like claiming something is “highly significant,” “marginally significant,” or “nearly significant.”

This is fundamentally confusion. If we care about alphas, we’re making a binary decision, in the Neyman-Pearson sense. We don’t care about the actual value of p, only whether it exceeds our alpha or not. In this case, we want to know about the beta too, to know how likely it is that the p-value would exceed the alpha across a bunch of tests. Then we can make inferences about the broader population. Importantly, this assumes that we’re going to do a bunch of tests. The results of only one or two runs the risk of leading us astray.

Otherwise, we can use the p-values descriptively as Fisher intended—we can talk about the probability that an effect is present in our current dataset. But we can’t generalise that to the broader population.

In neither case does it make sense to run one test and generalise it to the broader population. So of course, this is exactly what we do. Another case of the scientific ritual.

This common level of 0.05 comes from Fisher himself, who suggested it in his book on the p-value. He tells us that it’s a convenient rule of thumb for researchers in 1925—a time before computers rendered tables and approximations largely obsolete. A clever reader might wonder if the same rule of thumb is still appropriate 100 years later, when the rationale for the rule of thumb isn’t quite so relevant… ↩ ↩
Obviously, setting these levels—alpha and beta—depends on the relative consequence of each type of error. If the cost associated with a type-I error, like shutting down production for no reason, is high, we might choose a smaller alpha. On the other hand, if the cost of a type-II error, like allowing a bunch of defective widgets out on the market and mess up our reputation, is high, we might want greater power (and so a smaller beta). ↩