Article Text

PDF

P values: from suggestion to superstition
  1. John Concato1,2,
  2. John A Hartigan3
  1. 1Clinical Epidemiology Research Center, Cooperative Studies Program, Veterans Affairs Connecticut Healthcare System, West Haven, Connecticut, USA
  2. 2Department of Medicine, Yale University School of Medicine, New Haven, Connecticut, USA
  3. 3Department of Statistics, Yale University, New Haven, Connecticut, USA
  1. Correspondence to Dr John Concato, Clinical Epidemiology Research Center (151B), VA Connecticut Healthcare System, 950 Campbell Ave, 151B, West Haven, CT 06516, USA; john.concato{at}va.gov; john.concato{at}yale.edu

Abstract

A threshold probability value of ‘p≤0.05’ is commonly used in clinical investigations to indicate statistical significance. To allow clinicians to better understand evidence generated by research studies, this review defines the p value, summarizes the historical origins of the p value approach to hypothesis testing, describes various applications of p≤0.05 in the context of clinical research and discusses the emergence of p≤5×10−8 and other values as thresholds for genomic statistical analyses. Corresponding issues include a conceptual approach of evaluating whether data do not conform to a null hypothesis (ie, no exposure–outcome association). Importantly, and in the historical context of when p≤0.05 was first proposed, the 1-in-20 chance of a false-positive inference (ie, falsely concluding the existence of an exposure–outcome association) was offered only as a suggestion. In current usage, however, p≤0.05 is often misunderstood as a rigid threshold, sometimes with a misguided ‘win’ (p≤0.05) or ‘lose’ (p>0.05) approach. Also, in contemporary genomic studies, a threshold of p≤10−8 has been endorsed as a boundary for statistical significance when analyzing numerous genetic comparisons for each participant. A value of p≤0.05, or other thresholds, should not be employed reflexively to determine whether a clinical research investigation is trustworthy from a scientific perspective. Rather, and in parallel with conceptual issues of validity and generalizability, quantitative results should be interpreted using a combined assessment of strength of association, p values, CIs, and sample size.

  • Biostatistics
  • Clinical Research
  • Research Design

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/

Statistics from Altmetric.com

Introduction

Clinicians and biomedical researchers frequently encounter reports containing results of statistical analyses (eg, p=0.052). In particular, generations of practitioners and investigators have learned that a threshold probability value—or, more formally, a tail probability value—of ‘p≤0.05’ is used commonly to define statistical significance. For those unfamiliar with the underlying mathematical principles, however, the exact meaning of such information can be elusive. In addition, the corresponding procedures and practices themselves have been criticized.1–6 This report, in mainly non-mathematical terms, defines the p value, summarizes the historical origins of the p value approach to hypothesis testing, describes various applications of p≤0.05 in the context of clinical research, and discusses the emergence of p≤5×10−8 and other values as thresholds for genomic statistical analyses.

Definition and implications

Studies of exposure–outcome associations typically include four stages: specifying a research question, designing a study architecture, collecting data, and conducting a statistical analysis to draw inferences from the results. (Of note, we use exposure–outcome instead of cause–effect to avoid implications regarding causality; other characterizations include independent variable(s)-dependent variable). The predominant format for conducting statistical analyses is the frequentist approach, referring to the frequency of the occurrence of outcome events in repeated samples from a source population. Introducing an example that will be referred to later, if a ‘fair’ coin were to be flipped 10 times, every occurrence for the possible number of heads has an expected frequency, including a maximum of 24.6% for five heads and five tails, as well as lower expectations for the other possibilities (75.4% combined).

Consider a simple two-variable clinical scenario, with exposure as the independent variable and outcome as the dependent variable. Leaving aside various details and assumptions, the frequentist researcher often examines the results with respect to a null hypothesis of no association between exposure and outcome. In this context, the probability of an association at least as strong as what is observed, by random chance, is the p value. Importantly, the p value is not the probability that the null hypothesis, of no association, is true. Instead, we intentionally presume the null hypothesis—as a straw man argument—is true, to indirectly assess the plausibility of the data not conforming to it, notwithstanding issues of measurement error or systematic bias (including the concept of ‘confounding’). In more formal terms, the American Statistical Association recently published an editorial6 stating ‘a p-value is the probability under a specified statistical model that a statistical summary of the data (for example, the sample mean difference between two compared groups) would be equal to or more extreme than its observed value’.

From a practical perspective, a test statistic typically determines the probability of the observed result (or a more extreme result) occurring by random chance, if no association exists. Among various reasons for selecting a particular statistical test, one consideration involves the measurement scales of the variables describing the exposure–outcome association. A selected list of some commonly encountered tests is shown in table 1, including the χ2 test for categorical variables (eg, binary–binary comparisons, in a 2×2 table), t-test for a binary-continuous comparison, and log-rank test for time-to-event or ‘survival’ analyses when evaluating unadjusted associations. Logistic regression for binary outcomes and proportional hazards regression for time-to-event analyses are common approaches when evaluating adjusted, or multivariable,7 ,8 associations. Of note, different tests can be applied in the same situation, as with a Fisher's exact test in lieu of a χ2 test, especially when sample sizes are small.

Table 1

Examples of type of variables and selected statistical test(s)

Regardless of which statistical test is used, a ‘good’ result for assessing exposure–outcome associations is a small p value representing a low probability, thereby providing statistical evidence that an exposure–outcome relationship exists. In formal terms, the null hypothesis of no association is rejected. Specifically, p≤0.05 indicates that if no association exists, then the probability of the observed or a stronger association being attributable to chance is no greater than 1-in-20. Conversely, an analysis with p>0.05 is considered not statistically significant; chance is considered a plausible explanation, and the null hypothesis is not rejected (although it is never ‘accepted’, given that an infinitely sized source population is assumed).

Historical origins

Published work on using concepts of probability for comparing data to a scientific hypothesis can be traced back for centuries. In the early 1700s, for example, the physician John Arbuthnot analyzed data on christenings in London during the years 1629–1710 and observed that the number of male births exceeded female births in each of the years studied. He reported9 that if one assumes a balance of male and female births is based on chance, then the probability of observing an excess of males over 82 consecutive years is 0.582=2×10−25, or less than a one in a septillion (ie, one in a trillion-trillion) chance. As an early example of how statistical significance should not be the sole basis for interpreting results, Arbuthnot included what he called an explanatory note—with the findings linked to an assertion that ‘polygamy is contrary to the law of nature and justice’.9

In 1900, and during the initial development of the formal discipline of statistics,10 mathematician Karl Pearson11 described the χ2 statistical test, applied to topics including throws of dice and roulette balls at Monte Carlo.12 For example, examining data for n=26,306 dice throws, Pearson compared the observed versus the expected frequencies of 5s or 6s, based on a uniform probability for each face value; the tail probability of 0.000016 indicated that the dice were biased toward the higher values.11 Also in 1900, although not involving tests of statistical significance, work done in the late 1800s by the Austrian monk Gregor Mendel on inheritance patterns in peas was first fully appreciated.13 Mendel had established the genetic principles of segregation and independent assortment, and the renewed interest in Mendel's research later spawned a ‘Mendelian–biometrician’ controversy involving statistical methods14 ,15 that helped to spur development of the science of genetics.

Whether focused on games of chance, patterns of inheritance or other topics, research on statistical methods flourished in the early 20th century. In particular, the 1925 publication of Statistical Methods for Research Workers16 by the mathematician and biologist R.A. Fisher is considered a landmark event in statistics. This text, and later editions, is credited with helping to have developed a formal approach to significance testing using probability, or p values.

Threshold values

Importantly, when deciding on what p value threshold should indicate statistical significance, Fisher and other statisticians were not dogmatic. In 1926, as one of Fisher's early statements endorsing a p value of 0.05 as a boundary, he wrote: “…it is convenient [emphasis added] to draw the line at about the level at which we can say: ‘Either there is something in the treatment, or a coincidence has occurred such as does not occur more than once in twenty trials’.”17 In 1956, Fisher wrote: “[…] no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.”18

Despite Fisher's intent, ‘p≤0.05’ is currently a benchmark in many domains of scientific investigation. Thus, clinicians are often taught that p≤0.05 indicates statistical significance, based on the 1-in-20 threshold described earlier. If a clinical research study has a lower (better) p value of 0.001, for example, then the probability of chance alone explaining the findings would be one in a thousand—approximately the chance, invoking the previous coin scenario, of getting 10 heads in a row if a ‘fair’ coin is flipped 10 times (calculated as 0.510=0.00098, or ≈0.001).

Contemporary usage

The use of p values is now ubiquitous, but at the same time, their application has taken on excessive reverence, as if rituals are being followed in their application. For example, an editorial19 discussing a randomized trial of a therapeutic intervention indicated that ‘the trial failed to meet its goal: the P value for death for any cause was 0.052, which was higher than the pre-specified value of 0.05. All clinical trials are a gamble, and the [investigators] came close to winning but did not win. Thus, the results of the trial are difficult to interpret’. Although the assessment of results was clarified elsewhere in the editorial, these particular statements (in a high-impact journal) prioritized the p value threshold of 0.05. Readers might mistakenly view the original trial20 as a failure—only on the basis of a p value calculated to the third decimal place.

Two scenarios can illustrate why a p value threshold does not represent a win–lose situation. As shown in figure 1, in study A with n=87 participants, the p value is 0.062, not meeting the p≤0.05 threshold for statistical significance. If only two participants are added, however, and if the additional exposed participant has the outcome, whereas the additional non-diseased participant does not, then in study B with n=89 participants, the p value is 0.037—a statistically significant result. (Of note, these results were calculated using a Fisher's exact test).

Figure 1

In the first study ‘A’, with n=87, the relative risk is 2.5 (95% CI 0.99 to 6.5) and the p value is 0.062. In the second study ‘B’, with n=89, the relative risk is 2.7 (95% CI 1.1 to 7.0) and the p value is 0.037. The two studies are quite similar from an overall perspective, but ‘B’ is statistically significant, whereas ‘A’ is not.

Most readers would agree that the distinction between the two hypothetical studies, involving n=87 or n=89 participants, is modest. The relative risks are similar, showing that both studies suggest a similar level of strength of association. The appropriateness of the study design, the quality of data, or other issues can dominate a modest distinction in calculated p values.

Confidence intervals

Although beyond the scope of this paper, CIs are a more informative counterpart of p values, reporting mathematical stability in the format of the relative risk or other expressions (eg, OR, HR, risk difference) of the strength of association. From a pragmatic perspective, as shown in figure 2, and using relative risk for illustration, a 95% CI that excludes the null value is statistically significant—leading to the same conclusion with regard to statistical significance as a p≤0.05. Conversely, a 95% CI that includes the null value is not statistically significant.

Figure 2

p Values and CIs provide concordant information regarding statistical significance. A 95% CI that excludes the null value of one, as with a p value of ≤0.05, indicates a statistically significant result. A 95% CI that includes the null value of one, as with a p value of ≥0.05, indicates a statistically non-significant result. A p=0.05 occurs when a CI ends at 1.0 and is considered statistically significant.

Described in the 1930s21 and endorsed later by influential papers22 ,23 on this topic, CIs are now a welcomed accompaniment of p values, providing information on stability linked to information on the strength of association. Although vulnerable to the same problems as p values regarding inference, CIs can help to interpret analytic results.24 Consider a result with a ‘non-significant’ result, such as relative risk=1.4, 95% CI=0.80 to 2.4 and p=0.20. In another project addressing the same question, a stronger point estimate, wider CI and significant p value are determined, such as relative risk=4.1, 95% CI=1.2 to 14.0 and p=0.02. In a side-by-side comparison, the first scenario can actually be viewed24 as providing more trustworthy information on a possible association, given a narrower CI—despite the lack of statistical significance when judged by p≤0.05. The take-home message is that a p value alone does not provide comprehensive information on an analytic result.

Sample size and statistical power

The general relationship between sample size and statistical significance tends to be appreciated by experienced researchers, but may not be immediately apparent to clinicians. As shown in the top portion of table 2 (using data from a previously prepared example25), a relatively big quantitative difference (one-third vs one-quarter) can be statistically non-significant, due to small sample size, and such results should not be surprising. This situation provides an argument in favor of calculating a priori statistical power, based on the difference in outcomes that might be expected, to avoid ‘underpowered’ studies. Interestingly, the concept of statistical power—as with p values—was developed in the early 20th century,26 ,27 but its relevance was not appreciated widely by many researchers until decades later, with recognition that numerous underpowered studies had been published in the social science28 and medical literature.29

Table 2

Examples of large and small quantitative differences, and corresponding p values25

In brief, the power of a test is the probability that a statistically significant result will be detected, given the existence of an association of a certain (hypothesized) strength. Most randomized trials are designed to have at least 80% power, indicating a <20% chance of concluding that the data are not supportive of an exposure–outcome association if a specified exposure–outcome relationship were to exist (ie, making a false-negative inference). In practical terms, power should be calculated when a study is designed, and is then discussed again (but not recalculated) if the results are not statistically significant.

Clinical significance

Statistical (probabilistic) significance and clinical (quantitative) significance are different concepts. Regardless of arguments in favor or against hypothesis testing, few would argue against the claim that p≤0.05 has become a standard approach. In contrast, no single threshold for clinical significance exists—and none will likely develop—because each clinical context is different. For example, even a small improvement in survival for a new therapeutic agent for an aggressive type of cancer would likely be viewed positively; the same percentage improvement involving a benign and self-limited ailment might be met with less enthusiasm.

From a statistical perspective, the bottom portion of table 2 shows that even a modest quantitative difference (eg, 28.8% vs 28.2%) can be statistically significant with a large sample size. Analyses of information in healthcare (administrative) databases are likely contexts for this scenario to occur. Sometimes characterized as ‘overpowered’ analyses, the results can include clinically unimportant differences that achieve statistical significance. If the implications of p≤0.05 are misunderstood, then associations can be misinterpreted when evaluating research questions involving therapeutic effectiveness, quality improvement, and other topics.

Interestingly, an argument was made in the early 20th century that statistical significance does not confer quantitative significance,30 yet for decades many studies only reported p values. Other studies used ‘*’ to indicate p≤0.05 and ‘NS’ to indicate non-significance. Reporting p values without a relative risk, OR, etc, to describe the strength of association is inappropriate, and corresponding uncertainty in estimates should be provided (eg, using CIs). In addition, using symbols is less informative than reporting actual p values, in that the actual values quantify the probabilistic evidence against the null hypothesis. More generally, and almost 100 years later, a statement published in 1919 still applies: “[…] statistical ability, divorced from a scientific intimacy with the fundamental observations, leads nowhere.”30

p Values in the genomic era

The threshold of p≤0.05 was established when sample sizes in medical investigations tended to have a modest number of measurements per participant. In contrast, genome-wide association studies can evaluate hundreds of thousands, to several million, single nucleotide polymorphisms (SNPs) as the exposure variable, and a disease or trait as the outcome variable, for each participant. Studies involving whole-exome or whole-genome sequencing have even larger numbers to consider.

If a p value of ≤0.05 were to be used as the threshold for statistical significance in these situations, numerous associations would be expected by chance alone. Accordingly, and using various strategies for calculations, thresholds such as 5×10−8 have been proposed31 ,32 to distinguish ‘chance’ from potentially ‘real’ genomic associations. This particular approach is related to pregenomic concepts of multiple comparisons.33–36 For example, the Bonferroni correction35 calculates a threshold p value for each comparison as 0.05/N, where N is number of comparisons. Thus, when evaluating 10 associations in a clinical study, p≤0.005 for any comparison indicates statistical significance. Using this strategy for a genome-wide association study involving 1 million SNPs, p≤10−8 would be the calculated threshold for each polymorphism. The false discovery rate37 is another strategy used for this purpose.

Comments and caveats

As shown in table 2, a p value of ≤0.05 for a given strength of association can be achieved by enlarging sample size. Even for a fixed sample size, however, a calculated p value is not a unique assessment of any given data set. For example, using the data in figure 1 and choosing a χ2 test instead of a Fisher's exact test, p=0.04 for n=87 and p=0.02 for n=89, suggesting that both associations are statistically significant—just by choosing a different statistical test that analyzes the data via another conceptual approach and a different mathematical algorithm. As another example, and although other reasons for inconsistent results exist (including confounding), adding or removing several variables (ie, covariates) from a multivariable regression model can affect the p value, and possibly change the statistical significance, of a primary variable of interest.38 Clinicians and investigators should certainly not assume that p≤0.05 implies a ‘true’ association, even if the non-statistical aspects of a study are conducted and reported impeccably.

To emphasize the arbitrary aspect of p≤0.05, the field of physics commonly uses a threshold p value of 3×10−7 for statistical significance, based on observations at least 5 SDs from the null hypothesis. Although the concept of SD is not discussed in this review, p≤0.05 corresponds to ∼2 SDs, a less restrictive threshold to achieve. Other topics include issues such as one-tail versus two-tail significance testing. In brief, ‘tail’ or directionality refers to whether the hypothesis allows for a drug, for example, to have either a beneficial or a harmful effect, or is just expected to show benefit. Of note, a one-tail p value of 0.025 corresponds to a two-tail p value of 0.05, but two-tail (bidirectional) testing is endorsed in most situations.39 As a separate conceptual issue, this narrative focuses on p values in settings where independence of compared groups, such as treatment and control arms in a trial, is the desired outcome. In some situations, including the Monte Carlo example mentioned earlier,11 as well as for Mendelian genetics,13 similarity to a specific pattern is expected. (In addition, p>0.05 can even be a desired result, as with analyses to show that the observed data conform to a predicted model).

Finally, problems with the frequentist approach have been well documented, and Bayesian strategies are considered to be an appealing alternative approach.40–45 Consistent with the explanation provided in the Definition and Implications section, the frequentist conceptual approach can be stated as: ‘What is the probability that the observed data are inconsistent with the null hypothesis?’ In contrast, the Bayesian conceptual approach for a typical research project can be stated as: ‘Given the observed data, what is the probability that the true effect is negative (null)?’. Although the Bayesian flow of logic is more in keeping with how clinicians think, the need to specify the strength of association ahead of time—in formal terms, the ‘prior probability distribution’ of the effect size—seems to have made many researchers reluctant to adopt Bayesian methods.

Broad perspective

This review does not discuss all issues related to p values, and the topics that are included are presented only as an overview, or in illustrative terms. For example, several distinctions that exist46–49 between recommendations for significance testing by Ronald Fisher (focused mainly on a null hypothesis) and by Jerzy Neyman and Egon Pearson (incorporating an alternative hypothesis using power calculations) are not described. In addition, at least one health-related journal (Basic and Applied Social Psychology) has recently banned the ‘null hypothesis significance testing procedure (NHSTP)’,50 stating specifically that prior to publication, ‘authors will have to remove all vestiges of the NHSTP (p values [and] statements about “significant” differences or lack thereof, and so on)’.50

Notwithstanding such theoretical underpinnings and conceptual debates, authors of contemporary research articles should, at a minimum, avoid performing a perfunctory social ritual51 involving p values. This scenario involves thoughtlessly repeating the same action (significance testing procedures), focusing on a special number (p≤0.05), fearing sanctions (by reviewers or editors) for rule violations, and thinking wishfully (seeking, and sometimes manipulating, a desired p value) while limiting critical judgment (as reflected by superficial discussion of statistical results, such as the win–lose dichotomy described in the Contemporary usage section).

As a general guideline, authors should report—and thoughtfully interpret—results describing associations in terms of strength, such as relative risk, as well as stability, including the magnitude of p values and CIs. The size of the study population is also relevant. From a more general perspective, information on the stability of results is important, but the clinical relevance of a research report is also affected by the issues of validity, confirming that the results are correct for the participants involved, and generalizability, describing to whom the results apply.52 Box 1 provides several take-home points in this context.

Box 1

Take-home points for using probability values in clinical research

  • Assuming no association exists, a test statistic determines a p value for (ie, the tail probability of) an observed result, or a more extreme result, occurring by random chance.

  • The threshold p value of ≤0.05 for statistical significance, promoted in the early 20th century only as an informal suggestion, indicates a 1-in-20 chance of a false-positive inference (ie, assuming an exposure–outcome association when it does not exist).

  • Even if a study is conducted impeccably and reported accurately, clinicians and investigators should not assume that p≤0.05 implies a ‘true’ association—and comparing a p value to a threshold does not represent a win–lose situation.

  • In genomic studies, p value thresholds such as 5×10−8 reflect the extremely large number of associations (eg, alleles) being evaluated for each participant.

  • In addition to p values, or CIs (as another format for expressing stability of results), a numerical result for the strength of an association (eg, relative risk) is essential information.

  • Rigorous statistical analyses should be combined with relevant clinical insight regarding the corresponding research question, data collection, and study design.

  • While considering the conceptual issues of validity and generalizability, interpreting the numerical results of clinical research investigations should assess the strength of association, magnitude of p values, CIs, and sample size.

Conclusion

Statistical testing describes the stability of quantitative results from a probabilistic perspective, but tests of significance should not be viewed as an all-or-none approach, and p values should rarely be the main focus of attention or the primary basis for evaluating a research study. At the very least, p≤0.05 or p≤5×10−8 should not be employed reflexively to determine whether a study is trustworthy from a scientific perspective. Along with the conceptual issues of validity and generalizability, the strength of association, magnitude of p values, width of CIs, and size of the study sample are all relevant when interpreting the results of clinical research investigations.

In 1880, the biologist T.H. Huxley stated “…it is the customary fate of new truths to begin as heresies and end as superstitions.”53 After almost a century of originally being adopted, p value thresholds have evolved into a superstition. To improve medical research and ultimately clinical care, more judgment, and less ritual, is warranted.

Acknowledgments

The authors thank Peter Peduzzi for helpful comments and Sandra Augustitus for assistance in preparing the manuscript.

References

View Abstract

Footnotes

  • Contributors JC drafted the article and JAH contributed important intellectual content; both authors approved the final version.

  • Funding JC is supported by the VA Cooperative Studies Program.

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.