09 August 2017

Data bites: confusing cross-tabulations

Some recent research from a sample of 957 members of PureProfile's Australian panel showed that people who classified themselves as "Early Birds" were two times more likely than "Night Owls" to earn over $70k per annum.

Specifically, 23% of Early Birds earned over $70k p.a. vs just 11% of Night Owls.

Does that mean that 23% of those who earn $70k+ are Early Birds and 11% are Night Owls?

Nope. If that were true, that leaves two-thirds (66%) of $70+k earners who are neither Early Birds nor Night Owls.

Does the result mean that there are more Early Birds than Night Owls earning above $70k per annum.

Not necessarily.

If the Night Owls are far more numerous than Early Birds in the total sample, then it is quite feasible for there to be more Night Owls who earn $70k+ even while Early Birds are two times more likely to earn $70k+ than Night Owls.

Making this error is very easy - unfortunately - and even downright confusing in some situations. Here's an example that can seem particularly confounding.

PureProfile's research showed that in the Australian population, men are more likely to be Early Birds than women. About 56% of men are Early Birds compared with just 45% of women (see yellow shading in table below).

However, when we turn the result around so it expresses the proportion of Early Birds (and Night Owls) who are male vs female, we may be surprised to see that 50% of Early Birds are women and 50% are men. (In actual fact, there are slightly more women who are Early Birds than men as we will see in a moment).

Whaaaaaaaaaaaaaat? How can that be?

The problem is one that often confronts us when we do crosstabulations. A crosstabulation (often shortened to crosstab) is simply breaking down the frequency of responses on one variable by groups (in this instance, the groups are male and female).

People tend to get confused because they see the first result (56% of males are Early Birds), and think that this is equivalent to saying that 56% of Early Birds are males.

But this simply ain't so.

Let's break this example out. First, here's the raw counts in each cell. In this sample, there are 945 males who are Early Birds - or 945 Early Birds who are male if you prefer. It is the same thing!

And note that there are slightly more women who are Early Birds than men: 951 women vs 945 men.

The proportion (or per cent) of males who are Early Birds depends on the total number of males there are in the column.

The proportion of Early Birds who are male depends on the total number of Early Birds there are in the row.

So, in a nutshell, there are 945 males who are Early Birds. This represents 56% of the total number of males (column %), but just a fraction under 50% of the total number of Early Birds (row %).

The key takeout is this. Whenever a percentage is being reported, take note of the base. Are you looking at the % of the column (in which case the sum of the column is 100%) or the % of the row (in which case the sum of the row is 100%).

Understanding this distinction is important - and surprisingly often misunderstood. Here's one extreme example to highlight the problem.

Nearly 100% of sexual assaults are perpetrated by males but that does not mean that all males (or even a high percentage of males) are molesters/rapists - thankfully.

However, that doesn't stop many parents, airline policies and even national news anchors from treating all men as potential molesters. Most molesters are male, but most men are not molesters. Again, thankfully.

Drawing this conclusion, and worse, enacting policy based on this result reflects a gross misunderstanding and misinterpretation of the statistics. And it happens to lead to inappropriate stereotyping of a lot of good men. If interested, you can read more about this case here.

How to minimise the danger of this error?

Whenever reporting a percentage, be very clear about what is the base, ie x% of what? Quite simply, % of men is not the same as % of Early Birds.

Meghan Trainor - it's all about the base!

If you're preparing crosstabulations (crosstabs), I generally recommend (and myself, generally present) column percentages only. That way, you know you're always comparing the % of column 1 to the % of column 2.

But what goes into the column and what goes into the row? Generally, we try and put the Causal factor into the Column, and the Result into the Row. As sex is generally decided many years before we begin to decide whether we like to get up early or stay up late, sex is the cause (put it into the column) which is thought to determine the result, namely, whether or not you are an Early Bird.

If you do want to swap it around (and see what proportion of Early Birds are female vs what proportion of Night Owls), swap the row and column variables and rerun your crosstabulation. That way, you are still reading column percentages. (It can still be confusing, but hopefully less so).

And practice. Swap the rows and columns, see if which makes most sense.

Above all, do not mistake the per cent of the column to be the same as the per cent of the row.

No comments:

Post a Comment