The New Gastroenterologist

The P value: What to make of it? A simple guide for the uninitiated


 

How (not) to interpret the P value

Many clinicians do not consider other factors when interpreting the P value, and assume that the dichotomization of results as “significant” and “nonsignificant” accurately reflects reality.3

Authors may say something like the following: “Treatment A was effective in 50% of patients, and treatment B was effective in 20% of the patients, but there was no difference between them (P = .059).” The reason why they declare this as “no difference” is because there is no “statistically significant difference” if P = .059. However, this does not mean that there is no difference.

First, if the convention for the cutoff value for significance was another arbitrary value, say 0.06, then this would have been a statistically significant finding.

Second, we should pay attention to the magnitude of the P value when interpreting the results. As per definition above, the P value is simply the probability of a false-positive result. However, these probabilities may be greater than 5% with varying degrees. For example, a probability of false positive of 80% (P = .80) is very different from a probability of 6% (P = .059), even though, technically, both are “nonsignificant.” A P value of .059 can be interpreted to mean that there is possibly some “signal” of real difference in the data. It may be that the study above was not powered enough to see the difference of 30 percentage points between the treatments as statistically significant; had the sample size been larger and thus provided greater power, then the finding could have been significant. Instead of reporting that there is no difference, it would be better to say that these results are suggestive of a difference, but that there was not enough power to detect it. Alternatively, P = .059 can be considered as “marginally nonsignificant” to qualitatively differentiate it from larger values, say P = .80, which are clearly nonsignificant.

Third, a key distinction is that between clinical and statistical significance. In the example above, even though the study was not statistically significant (P = .059), a difference of 30% seems clinically important. The difference between clinical and statistical significance can perhaps be better illustrated with the opposite, and more common, mistake. As mentioned, a large sample size increases power, thus the ability to detect even minor differences. For example, if a study enrolls 100,000 participants in each arm, then even a difference of 0.2% between treatments A and B will be statistically significant. However, this difference is clinically irrelevant. Thus, when researchers report “statistically significant” results, careful attention must be paid to the clinical significance of those results. The purpose of the studies is to uncover reality, not to be technical about conventions.

Multiple testing and P value

Finally, another almost universally ignored problem in clinical research papers is that of multiple testing. It is not uncommon to read papers in which the authors present results for 20 different and independent hypotheses tests, and when one of them has a P value less than .05 they declare it as a significant finding. However, this is clearly mistaken. The more tests are made, the higher the probability of false positives. Imagine having 20 balls and only one of them is red. If you pick a random ball only once you have a 5% probability of picking the red one. If, however, you try it 10 different times, the probability of picking the red ball is higher (approximately 40%). Similarly, if we perform only one test, then the probability of a false positive is 5%; however, if we perform many tests, then the probability of a false positive is higher than 5%.

Next Article:

The month of new beginnings is here