Statisticians want to abandon sciences standard measure of significance

In science, the success of an experiment is often determined by a measure called “statistical significance.” A result is considered to be “significant” if the difference observed in the experiment between groups (of people, plants, animals and so on) would be very unlikely if no difference actually exists. The common cutoff for “very unlikely” is that youd see a difference as big or bigger only 5 percent of the time if it wasnt really there — a cutoff that might seem, at first blush, very strict.

It sounds esoteric, but statistical significance has been used to draw a bright line between experimental success and failure. Achieving an experimental result with statistical significance often determines if a scientists paper gets published or if further research gets funded. That makes the measure far too important in deciding research priorities, statisticians say, and so its time to throw it in the trash.

More than 800 statisticians and scientists are calling for an end to judging studies by statistical significance in a March 20 comment published in Nature. An accompanying March 20 special issue of the American Statistician makes the manifesto crystal clear in its introduction: “statistically significant — dont say it and dont use it.”

There is good reason to want to scrap statistical significance. But with so much research now built around the concept, its unclear how — or with what other measures — the scientific community could replace it. The American Statistician offers a full 43 articles exploring what scientific life might look like without this measure in the mix.

This isnt the first call for an end to statistical significance, and it probably wont be the last. “This is not easy,” says Nicole Lazar, a statistician at the University of Georgia in Athens and a guest editor of the American Statistician special issue. “If it were easy, wed be there already.”

Whats does statistical significance offer?

Many scientific studies today are designed around a framework of “null hypothesis significance testing.” In this type of test, a scientist compares results of an experiment asking, say, if a drug reduces depression in a treated versus control group. The scientist compares the results against the hypothesis that no difference really exists between the groups. The goal is not to prove that the drug fights depression. Instead, the idea is to gather enough data (eventually) to reject the hypothesis that it doesnt.

The scientist will compare the groups using a statistical analysis that results in a P value, a result between 0 and 1, with the “P” standing for probability. The value signifies the likelihood that repeating the experiment would yield a result with a difference as big (or bigger) than the one the scientist got if the drug doesnt actually reduce depression. Smaller P values mean that the scientist is less likely to see a difference that large if no difference really exists. In scientific parlance, the value is “statistically significant” if P is less than or equal to 0.05.

When scientists interpret P values correctly, they can be useful for finding out how compatible experimental results are with the scientists expectations, Lazar says. Because a P value is a probability, it “has variability attached to it,” she explains. “If I repeated my procedure over and over, Id get a whole range of P values. Some would be significant, some wouldnt.”

Because of this variability, P equal to 0.05 was never meant to be an end result. Instead, it was more of a beginning, “something that would cause you to raise your eyebrows and investigate further,” Lazar says.

Where did the idea for statistical significance come from?

Many scientists now interpret P equal to 0.05 as a cutoff between an experiment that “worked” and one that didnt. That cutoff can be attributed to one man: famed 20th century statistician Ronald Fisher. In a 1925 monograph, Fisher offered a simple test that research scientists could use to produce a P value. And he offered the cutoff of P equals 0.05, saying “it is convenient to take this point as a limit in judging whether a deviation [a difference between groups] is to be considered significant or not.”

That “convenient” suggestion has reverberated far beyond what Fisher probably intended. In 2015, more than 96 percent of papers in the PubMed database of biomedical and life science papers boasted results with P less than or equal to 0.05.

Whats the problem with statistical significance?

But science and statistics have never been so simple as to cater to convenient cutoffs. A P value, no matter how small, is just a probability. It doesnt mean an experiment worked. And it doesnt tell you if the difference in results between experimental groups is big or small. In fact, it doesnt even say whether the difference is meaningful.

The 0.05 cutoff has become shorthand for scientific quality, says Blake McShane, one of the authors on the Nature commentary and a statistician at Northwestern University in Evanston, Ill. “First you show me your P less than 0.05, and then I will go and think about the data quality and study design,” he says. “But you better have that [P less than 0.05] first.”

That shorthand also draws a bright line between scientific findings that are “good” and those that are “bad,” when in fact no such line exists. “On one side of the threshold, you label it one thing, and if it falls on the other side, its something else,” McShane says. But nothing in statistics, or reality, actually works that way. Strictly speaking, he says, “theres no difference between a P value of 0.049 and a P value of 0.051.”

What would it take to get rid of statistical significance?

Because statistical significance is entrenched in science culture, being used widely in decisions on whether to fund, promote or publish scientific research, a switch to anything else would take huge effort, says Steven Goodman, a Stanford University medical research methodologist who contributed one of the 43 articles of the special issue of the American Statistician. “The currency in that economy is the P value,” he says.

Computer programs that calculate a P value automatically from experimental data have helped to make the measure even more of a “crutch,” Goodman notes. Using it as the default means that scientists “havent developed the scientific muscles to understand what it means to reason under true uncertainty.” True uncertainty doesnt mean scientists throw up their hands and say the data dont reveal anything. In statistics, “uncertainty” refers to how much data is expected to vary from one experiment to another. Learning to interpret that uncertainty in scientific results, he notes, would require a lot more statistical training than many scientists usually get.

Shifting to one or many news kinds of statistics that better capture uncertainty would also mean that scientists would have to put more effort into making judgment calls. Journal editors and peer reviewers would have to learn to rely on other criteria to determine if a study was worth publishing. Scientific journals might have to change their standards. “Its very, very hard to dislodge,” Goodman says. “The world of science is not ruled or directed by statisticians.”

Partially because of the potential challenges of change, some scientists dont want to throw out statistical significance cutoffs just yet. Some want to start by raising the bar. Instead of P less than or equal to 0.05 as a cutoff, Valen Johnson, a statistician at Texas A&M University in College Station, prefers P less than or equal to 0.005 — a 0.5 percent chance that someone would observe a difference as big or bigger than the difference observed if the null hypothesis were true. “Its not quite an absolute threRead More – Source