Statistical Significance

Suppose there was only one person in group R and one in group X. It would be pretty easy to have a positive experimental result occur by accident. If radishes had no effect at all, there would still be a reasonable chance that the person in group R would have fewer cavities - maybe the person in group X just had a bad checkup. If there are ten people in each group and one group has substantially more people with cavities than the other, we can be pretty confident that the radishes were the deciding factor. While it's common that one person would have more cavities than another by chance, we would not expect chance to result in almost all the people with a low number of cavities to be in the same group.

But if 4 people in group R had cavities, as compared with 6 in group X, would it be safe to assume that the radishes had an important effect? Probably not. Even if eating radishes were irrelevant, we wouldn't be surprised to find one group ahead of the other by two. On the other hand, if there were two people with cavities in group R and nine in group X, it would seem pretty unlikely that this could have occurred by chance.

When experiments are properly designed, it is typical to specify a statistical test on the results and only consider the test to be a success if there is less than 5% (or sometimes 1%) chance that the result could have occurred by chance. If this occurs, the result is said to be "statistically significant". It is important to note that just because the result is not statistically significant we cannot assume that the radishes didn't have a beneficial effect. It may be that the effect was just not great enough to be obvious given the small number of people who were tested.

There is an additional rule that needs to be applied if statistical significance is to be valid. The exact test that is going to be performed must be determined before the data is gathered. Otherwise the experimenter could consider lots of different tests and only report the one that worked the best. Suppose our experimenter didn't just measure tooth decay, but also gum disease, colds, allergies, skin rashes, weight gain, hair loss, and academic performance. If each test had a 5% chance of being successful by accident, there are now more different "accidents" that might have resulted in a positive result, so the chances of accidental success could be much greater than 5%. Statistical procedures take into account the number of ways an experiment can succeed, but they aren't valid if the experimenter only decides what she is measuring after she knows what would work.

Occasionally tests of ESP will come out with a surprisingly low score, say the subject only guesses 6 right out of 100 when they would average 20 by accident. Sometimes researchers claim such a result shows there is an ESP effect, except that it worked in reverse in this case. If there is a 5% chance of accidental success for the normal result, and a 5% chance of accidental success for the reverse result, we now have a 10% chance of a supposedly significant result even if the effect doesn't really exist.