## 25 September 2014

### Don't understand statistics? Wanna bet? Guess the number of 'heads' out of 10 coin tosses, and you can keep this.
I think we tend to belabour the problem of statistical significance testing sometimes. The whole process is very intuitive.

Let me take you through a thought-experiment to show you this.
"I'm going to toss a coin 10 times. And to make it interesting, let me put this pineapple (Australian \$50 note) on the table and make you this offer: "If you guess the exact number of heads in the next ten coin tosses that I make, I will give you the \$50. If you don't guess it exactly, I get to keep my \$50. Are you willing?"
So assuming that you see that participating in this gamble is a "no-brainer", you say:
"Yes, sure."
So what number of coin tosses will be heads? Make your guess now.

H is for Heads

Most people guess somewhere around five (5/10). This is of course, the "null hypothesis", but more on that in a moment.

I now proceed to toss the coin, I catch it, I look, I call out the result. Here's the sequence of 10 coin tosses that I called in my most recent demonstration:
At about seven or eight heads, I noticed that people started to laugh. This is very important because this laughter reflects exactly the logic and the intuition underlying statistical significance testing.

Okay, so let's break it down.

H is for human intuition 1. The number of heads observed is what we could call a test statistic. Most test statistics have a name, often a letter (e.g., F, r, t, χ2(chi-squared), z, etc.) Let's just call this test statistic the H-statistic. In the above sequence of 10 coin tosses, H=9.

We know that this H-statistic can vary from 0 through to 10.

We also know that the expected result is about 50% (H=5) with results away from that being less and less probable as we move to the extremes.

Accordingly, the distribution of the H-Statistic under the "null hypothesis" is known and is illustrated in the chart provided above.
2. The observed result was 9 heads out of 10 coin tosses (H=9).

Based on our understanding of the distribution at #1 above, we can all agree this result (9/10) is possible, but pretty improbable assuming the coin, the tosses and the calls are fair (which is our expectation under the null hypothesis).

In fact, we can calculate the probability precisely from the chart above. The probability of exactly 9 heads is .01. The probability of 9 or 10 heads is .01 + .001 = .011.

Of course, the punters stood to lose the \$50 if there were too many heads or too few heads in the sequence of coin tosses. The probability of 9 or 10 tails (i.e., 1 or 0 heads respectively) is also .011.

So, the probability of an extreme value of the H-statistic (two-tailed test) is .011 + .011 = .022.
3. The key question here is what to make of this "improbable" result (p=.022 or less for instance)?

The laughter at seven or eight heads attests that many think that getting 9 heads out 10 seems just a little too flukey.

In other words, when presented with H=9, people are saying, "Sure, the result is possible, but I'm going to call 'bullsh!t'" or in a statistician's language, "statistical significance." We reject that the result was by chance, and conclude that something else was going on.
So every one of us has a naive statistician inside of us as reflected in steps 1-3 above. If you want a (slightly) more technical version, you can keep going, but in some senses, my work here is done!

H is the hard(er) version!

Step 1: Every test statistic (F, r, t, χ2, z, etc) is simply a measure of an observed result.

In the example above, the observed result was 9 out of 10, H-statistic = 9.

Step 2: The observed result in this instance is then compared with the known distribution of that statistic.

The distribution of the H-statistic is shown in the chart (for 10 coin tosses), but you could look it up in a table (binomial distribution) or use an online calculator. (The probability of success (i.e., "Heads") on a single trial is .5, the number of trials was 10, the observed result was 9.)

The probability of 9 or 10 heads was .011, the probability of 1 or 0 heads was .011; the combined probability of 9+ heads or tails is .022.

This p-value of .022 is a measure of how far what we observed was from what would have been expected.

As it turns out, the statistician's significance level is probably a little stricter than the human's significance level (as based on the laughter measure). The probability of 9 or 10 heads or tails (for the two-tailed test) is .022. The probability of 8 or more heads or tails is .11 (see here using 8 as the observed result instead of 9).

In short, in this case, the naive human statistician will call "bullsh!t" at 8 heads or tails out of 10, but a statistician wouldn't call "statistical significance" until observing 9 out of 10.

Step 3: We then compare this p-value with a significance (or alpha) level that we have set earlier. The significance level is a criterion marking our credulity level. If what we observe in the test statistic is sufficiently improbable, we call the observed result "statistically significant."

The significance level or alpha is determined by the researcher before beginning the research. In many fields, the default alpha level is p<.05.

As an aside, statistical packages like SPSS incorrectly label p-values as "sig-value" which is guaranteed to contribute to confusion. Significance levels are strictly in place before we even start collecting data, p-values are calculated at Step 2, and compared with the significance level at Step 3.

Step 4: We now interpret the result, especially where the p-value is lower than the preset significance level.

In the H-statistic example, the laughter was a reflection that the p-value for the observed result was lower than the (intuitively expressed) significance level. It's a human way of saying you just crossed my credulity threshold: "I expected something around 5:5, could accept a range around that, but this result is simply too improbable and sounding suspicious!"

There are two distinct interpretations that can be made when the observed p-value is less than significance level, p<.05.

One is to say that the result was surprising - a coincidence, chance, etc. Surprising things happen, and this is just one of those infrequent, low probability events. In fact, we even know its probability: p<.05 or a 1 in 20 event. The alternative? Plan B
The other interpretation is to say the observed result is not due to chance, but rather, it is reflects something else is going on.

This is called "rejecting the null hypothesis" or sometimes "accepting the alternative hypothesis."

Importantly, even if we do accept the alternative hypothesis (this was not a chance occurrence), we may have competing alternative explanations.

In our case of an H-statistic of 9 out of 10, various explanations might be proffered: "The coin has two heads," "You tossed the coin a particular way," some have even had the audacity to claim that I was lying. Outrageous!

These are are all alternative hypotheses. We may have rejected the null hypothesis, but we do not know which alternative is correct!

H is for happy?

So you see, statistical testing is not that difficult. (1) We start by calculating a statistic which is simply a measure of the observed result. (2) Somewhere out there, some mathematical statisticians will have done the arcane calculations to produce a distribution of that statistic showing probabilities for each value from most probable (expected value) to the more distant and improbable values. To get the probability of our observed result, we simply look at the area under the curve for the statistic value, 'H' or more extreme. (3) We compare the observed p-vale with our significance level, generally set at p = .05 and (4) if p < the sig-level, then we dismiss our expectation as the explanation and start searching for some alternative explanation for the observed result.

Comments? Errors? (Yep, statistics is still sufficiently tangled that this is very possible!) Let me know your thoughts.

#### 1 comment:

1. Wow, great post.