Showing posts with label data analysis. Show all posts

09 August 2017

Data bites: confusing cross-tabulations

Some recent research from a sample of 957 members of PureProfile's Australian panel showed that people who classified themselves as "Early Birds" were two times more likely than "Night Owls" to earn over $70k per annum.

Specifically, 23% of Early Birds earned over $70k p.a. vs just 11% of Night Owls.

Does that mean that 23% of those who earn $70k+ are Early Birds and 11% are Night Owls?

Nope. If that were true, that leaves two-thirds (66%) of $70+k earners who are neither Early Birds nor Night Owls.

Does the result mean that there are more Early Birds than Night Owls earning above $70k per annum.

Not necessarily.

If the Night Owls are far more numerous than Early Birds in the total sample, then it is quite feasible for there to be more Night Owls who earn $70k+ even while Early Birds are two times more likely to earn $70k+ than Night Owls.

Making this error is very easy - unfortunately - and even downright confusing in some situations. Here's an example that can seem particularly confounding.

PureProfile's research showed that in the Australian population, men are more likely to be Early Birds than women. About 56% of men are Early Birds compared with just 45% of women (see yellow shading in table below).

However, when we turn the result around so it expresses the proportion of Early Birds (and Night Owls) who are male vs female, we may be surprised to see that 50% of Early Birds are women and 50% are men. (In actual fact, there are slightly more women who are Early Birds than men as we will see in a moment).

Whaaaaaaaaaaaaaat? How can that be?

The problem is one that often confronts us when we do crosstabulations. A crosstabulation (often shortened to crosstab) is simply breaking down the frequency of responses on one variable by groups (in this instance, the groups are male and female).

People tend to get confused because they see the first result (56% of males are Early Birds), and think that this is equivalent to saying that 56% of Early Birds are males.

But this simply ain't so.

Let's break this example out. First, here's the raw counts in each cell. In this sample, there are 945 males who are Early Birds - or 945 Early Birds who are male if you prefer. It is the same thing!

And note that there are slightly more women who are Early Birds than men: 951 women vs 945 men.

The proportion (or per cent) of males who are Early Birds depends on the total number of males there are in the column.

The proportion of Early Birds who are male depends on the total number of Early Birds there are in the row.

So, in a nutshell, there are 945 males who are Early Birds. This represents 56% of the total number of males (column %), but just a fraction under 50% of the total number of Early Birds (row %).

The key takeout is this. Whenever a percentage is being reported, take note of the base. Are you looking at the % of the column (in which case the sum of the column is 100%) or the % of the row (in which case the sum of the row is 100%).

Understanding this distinction is important - and surprisingly often misunderstood. Here's one extreme example to highlight the problem.

Nearly 100% of sexual assaults are perpetrated by males but that does not mean that all males (or even a high percentage of males) are molesters/rapists - thankfully.

However, that doesn't stop many parents, airline policies and even national news anchors from treating all men as potential molesters. Most molesters are male, but most men are not molesters. Again, thankfully.

Drawing this conclusion, and worse, enacting policy based on this result reflects a gross misunderstanding and misinterpretation of the statistics. And it happens to lead to inappropriate stereotyping of a lot of good men. If interested, you can read more about this case here.

How to minimise the danger of this error?

Whenever reporting a percentage, be very clear about what is the base, ie x% of what? Quite simply, % of men is not the same as % of Early Birds.

Meghan Trainor - it's all about the base!

If you're preparing crosstabulations (crosstabs), I generally recommend (and myself, generally present) column percentages only. That way, you know you're always comparing the % of column 1 to the % of column 2.

But what goes into the column and what goes into the row? Generally, we try and put the Causal factor into the Column, and the Result into the Row. As sex is generally decided many years before we begin to decide whether we like to get up early or stay up late, sex is the cause (put it into the column) which is thought to determine the result, namely, whether or not you are an Early Bird.

If you do want to swap it around (and see what proportion of Early Birds are female vs what proportion of Night Owls), swap the row and column variables and rerun your crosstabulation. That way, you are still reading column percentages. (It can still be confusing, but hopefully less so).

And practice. Swap the rows and columns, see if which makes most sense.

Above all, do not mistake the per cent of the column to be the same as the per cent of the row.

10 November 2012

Statistical significance is just like a horse race

Green Moon had a chance of less than 1 in 22 of winning the race

The logic of statisticians can seem very complicated and impenetrable to normal folk.

But it really is just a formalised version of our own lay style of how we explain unusual events.

When something unusual happens, there are two possible interpretations. One is to view the unusual event as a freak occurrence, a chance-result, a coincidence. The other is to view the event as a sign that our understanding of what is going on is fundamentally wrong.

So, is the unusual event simply surprising or does it stretch credulity? Did we see a rare occurrence or is there some other explanation?

It's a bit like interpreting the result of a horse race won by a horse with long odds. Is the win a possibility even if improbable, or is it so improbable as to be considered an 'impossibility' requiring a brand new explanation.

Read more on this idea in this article posted on The Drum / ABC : The Melbourne Cup and Statistical Significance

11 November 2011

Building better brand metrics : multi-collinearity as friend rather than foe

Multi-collinearity looks more complicated than it is !

Metrics are hot. Multi-collinearity is not.

Multi-collinearity. It is a big word – and a big mystery to many students of statistics and even practitioners – just like the word, heteroscedasticity!

The existence of multi-collinearity actually makes the world a simpler place in a practical sense.

Are your clients currently enthused by Balanced Score Cards, Brand Metrics, Net Promoter Score and various other tools that consist of many apparently independent measures for assessing the health of a company and/or brand? Well, stay tuned because it does not have to be that hard, as multi-collinearity will show!

Multi-collinearity is simply the problem of two predictor variables being correlated with one another such that the contribution of each to the criterion is difficult to tease apart. Imagine trying to predict purchase intentions, and we measure both ‘price’ and ‘value.’ Clearly both are useful for predicting purchase intentions, but the two are also very likely to be correlated to one another. This means that once we know one, the other does not add much to our prediction.

Okay, we understand the problem, but do we understand how often we encounter this situation? And how often we may be misrepresenting the results to our clients as a consequence of many correlations between the predictor variables we report to our clients? If you are using any kind of multi-attribute rating models (e.g., Vroom’s expectancy-valence model, Fishbein & Ajzen’s original attitude-model, Gale’s Customer Value Analysis model, etc.), then you are likely encountering this problem. These are the models where you measure how customers rate various attributes of the brand, and use these ratings to determine what are the ‘drivers’ of brand purchase.

Typically, ratings of any brand on these attributes are highly correlated. For instance, if you chose to assess ratings of ‘price’ and ‘value’ as two separate attributes, they will typically be highly (negatively) correlated. In a multiple regression, the result is that one will contribute significantly to the regression, and the other one, because it is highly correlated to the first, will not.

‘Aha,’ you say, ‘but I take explicit measures of the importance ratings.’ Yes, well unfortunately this does not solve the problem. As most of us know, respondents will typically tell us that all attributes are pretty important. You can play games with constant-sum scales that helps differentiate importance, but you are still not dealing with the problem of what I might call non-statistical multi-collinearity. The problem is that if you ask a respondent how important is ‘value’ and how important is ‘price’, they will probably give a fairly equal importance rating to both. Why wouldn’t they – they are really much the same!

What is the solution? One suggestion is to retain just one of the multiple correlated items. This is certainly one solutionn – and links to a tangential issue about better quality drafting of questions. If we can anticipate ahead of time that two attributes are going to be highly correlated, we can consider measuring just one or the other.

However, I am also a great believer in combining separate items collected on a questionnaire as they provide a more stable (reliable) measure than using a single item. That is, I measure multiple attributes, even if they are likely to be correlated. Then, I conduct an examination of the intercorrelations of the various attributes to see if I can simply combine two or more items into one scale. If I want to be really sophisticated, I could conduct a factor analysis for guiding the combination of items. This allows for a sophisticated weighting of each variable in the final ‘scale.’ However, I generally find that clients (and analysts) find simple, averaged scales much easier to interpret than factor scores.

However, one rather disturbing result that I have found in examining these intercorrelations among attribute ratings is many of the ratings are correlated with many of the others! Even among sophisticated respondents such as doctors, I find that the ratings they give to a drug in terms of potency, efficacy, side-effect profile, drug-interactions, cost and value are all likely to be correlated. More broadly, I find that on many research projects, many of the intangible qualities of the brand that we might measure (brand awareness, attribute ratings, overall evaluations, satisfaction, usage, etc.) are all highly correlated.

Many researchers appear to be unaware of or unwilling to acknowledge such correlations, and will happily make recommendations to tweak a particular quality of the brand in order to improve overall image, satisfaction, purchase intentions, etc. However, if all these attributes are so highly correlated, advice to tweak one or the other is at best rather meaningless and at worst, rather misleading.

However, to offset the bad news, there is some good news. The good news is that the intercorrelations between the many predictor variables means that we do not need to consider a screed of so-called independent measures to assess the health of a brand. Some clients have had me explore various brand metrics, and what I find is that often-times, we need only look at relatively few numbers rather than many to assess the health of our brand! Why? Because of multi-collinearity. The independent measures are often so highly correlated that they can all be combined into one or at least relatively few scales which capture most of the important intangibles.

For instance, in research we have conducted in both pharmaceutical and agricultural domains, we have found that we can reduce many of the measures of the intangibles (such as customer attribute ratings of the brand among other things) down to perhaps three dimensions which operate as very strong predictors of brand purchase.

Of course, you might like to know what those dimensions are, right? Unfortunately, that would be telling! Nevertheless, I have given you the key to simplifying the intangible assets of the brand. You can work it out.

And as to heteroscedasticity, I will leave that to another day.

Multicollinearity - the magic behind brand metrics

Reducing two problems to one solution

Metrics are hot. Multi-collinearity is not!

There are multiple metrics for measuring the health of the brand and/or company, but simple interpretations seem to be sparse.

Meanwhile, multi-collinearity is generally identified as a statistical problem in which two or more predictor variables are correlated with one another which muddies the interpretations of what variables predict important outcomes such as purchase intentions.

However, if the two are put together, the world of marketing can become very much simpler and clearer.

The existence of multi-collinearity actually makes the world a simpler place in a practical sense, and by understanding the simplifying properties of multi-collinearity, we can simplify the metrics that we use to measure things such as brand-health.

Problem of Multiple Metrics

Market researchers have built their empire on multiple metrics; call them measures or items or questions if you prefer. And modern marketing is encouraging this through the pursuit of ‘metrics’ such as the Balanced Score Card, Brand Metrics, Net Promoter Score, etc.

And in turn, each of these metrics is made up of many apparently independent measures. For instance, most brand metrics will include one or more measures of dimensions such as awareness, familiarity, liking, attribute-rating, attribute-importance, attitude, satisfaction, purchase intentions, etc.

The notion is that what is measured is managed, and so if we are measuring something, then we are on the way to managing it.

The output from studies is multiple metrics, multiple charts, multiple tables, multiple pages, and multiple options but few clear solutions.

Problem of Multi-collinearity

For most researchers, multi-collinearity is a disaster. More disturbingly, for some researchers, it is a bit of a mystery.

Multi-collinearity is simply the way in which two or more predictor variables in a multiple regression are related to one another. For instance, many companies are keen to know what ‘drives’ the customers' decision making. In predicting what drives doctors' prescribing behaviors, three of the attributes that are often included for consideration are ‘efficacy,’ ‘side-effect profile’ and ‘drug interactions.’ It is not uncommon to find that all three attributes are positively correlated to prescribing intentions.

A simplistic interpretation is that each attribute is an independent driver of prescribing. However, if a regression is run, it may be found that only one attribute predicts intentions, the weights for the other two being not significantly different from zero. This is a clue that multi-collinearity maybe in operation.

Multi-collinearity can be seen if intercorrelations of the independent variables are examined. If two or more are significantly intercorrelated, we probably have multi-collinearity issues.

More simply, if three attributes such as efficacy, side-effect profile and drug interactions are correlated with one another, and only one of them is significant in a multiple regression model predicting prescribing, then we can conclude that there is really probably only one dimension underlying all three attributes which is driving prescribing intentions.

Implications

Multiple metrics sound great, but too many make simple interpretation difficult. Most market research studies are built based on multiple metrics (or more simply, measures or items or questions). How often do we misrepresent the results by reporting the many correlations between the predictor variables and a dependent variable such prescribing – without allowing for the possibility that the various predictor variables are correlated or simply multiple restatements of one underlying relationship?

What is the solution? Well, first, we can examine the relationships between the key metrics.

Then we have two options for handling metrics that are correlated with one another. One solution is to retain just one of the multiple correlated items. At a more practical level, if we can anticipate ahead of time that two (or more) metrics are going to be highly correlated, we can consider measuring just one or the other.

Another solution is one that combines the separate metrics into a single index as this provides a more stable or reliable estimate. That is, we use multiple measures, even if they are likely to be correlated, and then combine items that are correlated with one another into one scale.

If we want to be really sophisticated, we could conduct a factor analysis for guiding the combination of items. This allows for a sophisticated weighting of each variable in the final ‘scale.’ However, in my experience, clients (and analysts) find simple, averaged scales much easier to interpret than factor scores.

Multi-collinear metrics

For some, multi-collinearity is understood to be a problem, and something we would prefer to avoid. However, multi-collinearity is not a problem, it is reality. Multi-collinearity tells us that in many senses the world is a simpler place than we originally thought – and when business is as complicated as it is today, this is a very good thing.

My own exploration in this area came from the initially disturbing finding in examining intercorrelations among many of the ratings of brands on various attributes. Even among sophisticated respondents such as doctors and specialists, the ratings they give to a specific drug in terms of potency, efficacy, side-effect profile, drug-interactions, cost and value are all likely to be correlated. More broadly, on many research projects, I have found that many of the intangible qualities of the brand that we might measure (brand awareness, attribute ratings, overall evaluations, satisfaction, usage, etc.) are all highly correlated.

Many researchers appear to be unaware of or unwilling to acknowledge such correlations, and will happily make recommendations to tweak a particular quality of the brand in order to improve overall image, satisfaction, purchase intentions, etc. However, if all these measures are so highly correlated, advice to tweak one or the other is at best rather meaningless and at worst, rather misleading.

However, to offset the bad news, there is some good news. The good news is that the intercorrelations between the many predictor variables means that we do not need to consider a screed of so-called independent measures to assess the health of a brand. Some clients have had me explore various brand metrics, and what I find is that often-times, we need only look at relatively few numbers rather than many to assess the health of our brand! Why? Because of multi-collinearity. The independent measures are often so highly correlated that they can all be combined into relatively few scales which capture most of the important intangibles.

For instance, in research I have conducted in both pharmaceutical and agricultural domains, I have found that we can reduce many of the measures of the intangibles down to three dimensions that operate as very strong predictors of brand purchase.

Half a Mind 2

Pages