# Problems with p-values

October 24, 2023

Excerpt

When we want to know if a statistical test is significant or not, we usually turn to the p-value. In psychological research, almost universally, we want a p-value that’s less than 0.05. If the p-value is smaller than this, then we say that there’s a statistically significant effect. But there’s this interesting historical past to the p-value that makes this particular approach pretty messy.

P-values are no gold standard. The way we use them today means p-values have a probability distribution. Just one could be an outlier, and the way publishing works, probably is. It's the reason for so many 'too good to be true' findings---they are.

No headings in this article!

filed under:

**Article Status**: Complete (for now).

When we want to know if a statistical test is significant or not, we usually
turn to the p-value. In psychological research, almost universally, we want a
p-value that’s less than 0.05.^{1} If the p-value is smaller than this, then we say
that there’s a statistically significant
effect.

What this means is that we can never make a claim that
there is *no effect*. We can only ever claim that there is evidence for an
effect, or that we don’t know if there is or isn’t an effect. We make it fancy
by saying there is “no evidence” for an effect, but this is often very
misleading. However, problems
aside, this sort-of makes sense. We might not care if there was no-effect if we’re
testing some drug. We only care if the drug is effective, not proving that it’s
not effective.

But there’s this interesting historical past to the p-value that makes this particular approach pretty messy in another, more fundamental way.

The simple explanation of a p-value is that it’s very simply the *probability*
that the test statistic you got is the same as one you’d get by random chance.

Although used earlier, the p-value was popularised by a bloke named Ronald Fisher. He basically thought we could use the p-value as a way of indicating how strong the evidence was that there’s something happening in a dataset that wasn’t due to random chance. You see, the likelihood that you get a strong correlation by random chance is quite high if you’re just looking at just a handful of people, so your p-value is going to indicate that: it’s going to be quite a high number. But if you’re looking at 150 people, then the chance your strong correlation is due to some random sampling error is lower, and the p-value will be lower to reflect that.

So, you could look at your p-value and you can say things like “it seems pretty likely that we have an effect in our dataset here”, or “it doesn’t seem likely that this is more than random chance in the dataset we’re looking at”.

This is *not* how we use the p-value, though. Unfortunately, Fisher’s method
doesn’t tell people when we can actually make a *decision* about whether we
have an effect, nor can we generalise outside of our dataset. We can only make
comments on the strength of evidence in the sample we’re looking at. It’s a
descriptive statistic, like a mean or a standard deviation.

Along came Jerzy Neyman and Egon Pearson with a solution. Many people wanted to
be able to make a *formal* decision about whether to conclude there was
some effect happening in a dataset or not. Manufacturers were particularly
interested. They wanted a way to use statistical tests for quality assurance
purposes—how could they make standardised decisions about whether there was something
wrong with their widgets or not, without fussing with probabilities and
minimising the chance that their decision was a mistake?

Neyman and Pearson found a neat answer. First, you *assume by default* some
null-hypotheses. Let’s say you’re a manufacturer pumping out widgets. You might
assume that your factory machines are working fine, and they’re pumping out
quality widgets. You can now create an alternative-hypothesis. You might say,
the alternative is the factory machines *are* broken, and we’re pumping out
messed up widgets. With these two hypotheses in mind, you can run a test on
your widgets—too many messed up widgets would tell you that you should swap
from your default assumption that everything is fine to your
alternative—your machines have broken. In statistical terms, we can either
*fail to reject* the default assumption—the null-hypothesis—or we *reject*
the null in favour of our alternative-hypothesis.

In doing this, there are three possible outcomes:

- You correctly keep assuming the machines and widgets are fine when they’re fine, or correctly switch to assuming the factory machinery is broken it’s broken and pumping out messed up widgets.
- You could mistakenly swap from the assumption nothing’s broken to the
assumption things
*are*broken even though they aren’t (reject a true null-hypothesis; a “type-I error”). - Or you could mistakenly keep assuming nothing’s broken when something actually
*is*messed up (fail to reject a false null-hypothesis: a “type-II error”).

Now that you know all this, you can set some criteria for how often you’re
happy to make these type-I and type-II errors. You can say, “You know, 5% of
the time I’m happy to risk making a type-I error where I incorrectly conclude
that the widget-making machines are broken when they’re actually working fine.”
This is setting your alpha (α) level. The alpha level is your threshold for the
probability of rejecting the null hypothesis when it’s actually true. Here, the
5% chance translates to the very commonly used α of 0.05.^{1} We’re saying that
we’re happy to accept a 5% chance of stopping our production line because of a
false alarm—a type-I error. We use the p-value to determine whether alpha is
exceeded or not—a p-value lower than the alpha means we haven’t exceeded our
desired probability. A p-value higher than the alpha means we *have*, and so
it’s time to shut everything down and see what’s wrong.

So, now, every now and again we run our quality control process—selecting
some widgets and seeing what proportion of them are messed up. We ask if the
*probability* of this many messed-up widgets in a batch exceeds our alpha—if
so we *reject the null*: there’s probably some issue with our machines and we
might want to shut production down and have a look. On the other hand, even if
our proportion of broken widgets is higher than usual, as long as it doesn’t
exceed alpha, we won’t reject the null—we probably reckon it’s just random
chance causing more broken widgets than usual, not something systemic.

We see the alpha and the p-value a lot in scientific research, but there’s
another important part of this process—controlling the risk of a type-II
error. In this case, it’s the chance we make the mistake of assuming everything
everything is fine with the machines when they’re actually broken and pumping
out messed up widgets. This error rate is called beta (β). The *power* of the
test we’re going to be conducting, which is 1 - β, represents the probability
that we correctly detect when the machines are broken *when they’re actually
broken*. You might decide that you want to have 80% power, which means we’ll
set our β at 0.20. This means we’re accepting a 20% chance of overlooking the
problem when the machines are not performing properly.^{2} To work out the
power of our test, we need to know a few things, like what alpha we’ve chosen,
the sample size (number of widgets we’re looking at), and some features of the
population (all of the widgets, not just the sample we’re looking at).

Very critically, we need to work this out *before* we do the test, to know how
big of a sample size we need to achieve the power we want while also making our
desired alpha level meaningful. More on this in a bit.

Now, you’ll notice that in much research, we rarely hear anything
about the beta. We only ever seem to specify an alpha and talk about the
p-values. The reason is that the type-II error rate completely depends on us
knowing what the likely *size* of an effect is in a population. In the context
of a factory floor, we can know very clearly what too many broken widgets looks
like. But in lots of scientific research, it’s very hard to know how big effects
might be in the population. How many people really *do* think in a certain way
under certain conditions? How well does a drug do across all people in all
contexts? So researchers just don’t bother, or instead have a guess at the
effect size based on rules of
thumb. Then, they
go ahead and focus exclusively on the alpha—the criterion that tells us when
we might have made a type-I error, that we thought we saw something in our data
when it was actually just due to chance.

The problem is that, without an informed beta, this doesn’t really
*mean* anything. The beta tells us about the power: the ability of the test to
detect an effect whenever we look at a sample of the population. The alpha
*only tells us about the result of a single test*. The p-value is like a
snapshot—what evidence this particular test provided. Another way to put
it is, the beta tells us how likely the *p-value* is. If we have greater power,
then we’ll have a higher likelihood of producing a significant result (p-value
less than the chosen alpha level) when the effect is real (when the
null-hypothesis is false). Low power means we’ll have a higher likelihood of
finding no evidence (p-value greater than alpha) when the effect is real (the
null-hypothesis is false).

The implication here is that, under this model, the *p-value itself* has a
probability distribution. So we want to do *lots of tests*, and produce *lots
of p-values*, so that we can tell whether we have a significant effect or not,
because in our example there’s a 20% chance that we’re overlooking the problem!
One in *every five p-values* could be leading us astray.

On the factory floor, this is no problem. We can do tests on every batch of widgets we produce. The odd misleading p-value isn’t going to throw us off very much. In other kinds of research though, we almost never run the same test more than once because it’s hard to replicate the conditions of our tests. And even when we do, because it’s hard to publish non-significant results, if the replicated tests are non-significant, we only ever see the misleading p-values.

This is one of the major contributors to the replication crisis, a fairly recent insight across almost all scientific disciplines that many ‘significant’ results are misleading. It’s why we see news articles that tell us one year that ‘red wine is bad for you’ and the next year that it’s good.

You see, the way we’ve come to use the p-value is some kind of hybrid of Fisher’s probability and the Neyman-Pearson formal decision. We’ll include a comparison of the p-value to the alpha to claim existence of an effect, we’ll report the p-value itself, and we’ll sometimes make relative claims about the evidence using these things, like claiming something is “highly significant,” “marginally significant,” or “nearly significant.”

This is fundamentally *confusion*. If we care about alphas, we’re making a
binary decision, in the Neyman-Pearson sense. We don’t care about the actual
value of p, only whether it exceeds our alpha or not. In this case, we want to
know about the beta too, to know how likely it is that the p-value would exceed
the alpha across a bunch of tests. Then we can make inferences about the
broader population. Importantly, this assumes that we’re going to *do a bunch
of tests*. The results of only one or two runs the risk of leading us astray.

Otherwise, we can use the p-values descriptively as Fisher intended—we can talk about the probability that an effect is present in our current dataset. But we can’t generalise that to the broader population.

In neither case does it make sense to run one test and generalise it to the broader population. So of course, this is exactly what we do. Another case of the scientific ritual.

This common level of 0.05 comes from Fisher himself, who suggested it in his book on the p-value. He tells us that it’s a convenient rule of thumb for researchers in 1925—a time before computers rendered tables and approximations largely obsolete. A clever reader might wonder if the same rule of thumb is still appropriate 100 years later, when the rationale for the rule of thumb isn’t quite so relevant… ↩ ↩

Obviously, setting these levels—alpha and beta—depends on the relative consequence of each type of error. If the cost associated with a type-I error, like shutting down production for no reason, is high, we might choose a smaller alpha. On the other hand, if the cost of a type-II error, like allowing a bunch of defective widgets out on the market and mess up our reputation, is high, we might want greater power (and so a smaller beta). ↩

Ideologies you choose at btrmt.