- Calculating probabilities for two-category responses.
- Binary choice testing.
- A-B testing.
- Normal approximations for categorical data, and its limits.
- Binomial, Fisher’s Exact, Chi-square, and G tests.
- Multiple design (”multivariate”) testing.
- Pitfalls in on-line apps.
Over at DIY themes, a simple A-B test lead to a startling conclusion: Removing “social proof” text at sign-up doubled conversions from 1% to a blistering 2%. Over 80 comments were exchanged debating how such a “sure-fire persuasion tactic” hurt web site performance. Few commenters are suspicious of the small “sample size” (the given numbers suggest 8 conversions out of 793 visitors for one site design, and 15 conversions out of 736 visitors for the other design). However, only Stephen comes close to asking the right question: what is the p-value? Using a p-value calculator supplied by the service that did the A-B test, he gets 0.051, which he declares not significant (i.e., p is more than 0.050).
Is that enough to ignore the results? If you read the first post in this series you know that “statistical significance” is not some magic border between reality and falsehood. For the purposes of making business decisions, the difference between 0.051 and 0.050 is barely anything. It’s equal to, well, the difference between 0.051 and 0.050. But that doesn’t mean I think we’ve evidence that social proof had backfired in this case.
“How many?” It’s the simplest quantitative data you can imagine: the mere count or frequency of something. How many users fail to complete a task? How many conversions? How many prefer Design A over Design B? Simple as it is to measure counts (or related percents), it is among the most complicated forms of data to analyze, with several viable options each with their own strengths and limitations. It’s easier than you might expect to get it wrong.
But how complicated can counts be? After all, we’ve already dealt with counts in Stat 101. There, we calculated the p-values of two or more users having a problem with a site or app given a couple hypothetical states. For example, we determined that, assuming a population failure rate of 10%, there is a 0.028 probability of 2 or more users failing out of a sample size of three. Based on that, we concluded that the actual population rate is over 10%. We did it all without doing any math.
As Tonto said to the Lone Ranger when they were surrounded by hostile Sioux, “What do you mean ‘we,’ pale face?” Of course, I did the math for you behind the scenes. What I actually did was, using the 10% hypothetical chance of a single failure, calculate every possible path to getting 2 out of 3 failures. I calculated the probability of the first two users failing but not the third, the probability of the last two users failing but not the first, and the probability of the first and third user failing but not the second. I summed all those probabilities to get the probability of 2 out of 3 failing. Then I did the same for 3 out of 3 failing (there’s only one path for getting that) and added that in to get the total probability of 2 or more failing for a sample size of three. To make the graphs and table for a sample size of three, I did the same for 3, 1 and no users failing out of three.
The probability of getting a certain count of events (e.g., failures) given the chance of a single event (failure) is a binomial probability. The “bi” in “binomial” in this case means “two possible outcomes” for each event -succeed or fail for example. That’s in contrast to other kinds of events, like a user rating something on a 35-point scale, which has 35 possible outcomes, or the time to complete a task, which has roughly a bjillion possible outcomes, depending on the precision of your stopwatch. For those situations, we already saw that you get p-values on the averages using a t-distribution as our sampling distribution.
For the number of users failing (versus succeeding), we use the binomial distribution as a our sampling distribution, which comprises the probability of every possible result (counts of users) of every possible sample for a certain sample size and hypothetical state. Each column of each table in my tables of binomial probabilities from Stat 101 is a sampling distribution.
The process for calculating binomial probabilities to get the sample distribution actually isn’t so complicated. We’re helped tremendously by assuming that the users perform independently -that the chance of the second user failing is uncorrelated by the chance of the first user failing, for example. Specifically we assumed that each user has a 10% chance of failing (our Hypothetical State A) regardless of how every other user actually performs in each possible path. Such independence of events is reasonable to expect as long as you’re testing each user separately.
Independence simplifies life for the statistician. When you assume independence, then the you can get the probability of any path by multiplying together the probability of each event in that path. For example the probability of the first two users failing but not the third is 0.10 * 0.10 * 0.90 = 0.009 (the last number is the probability of not failing). It also means the probability of the first two users failing is the same as the probability of the last two users failing is the same as the probability of the first and third user failing, thanks to the good ol’ commutative property that you learned in grade school. Once you know the probability of one path, you know the probability of all paths. So the way to actually calculate binomial probabilities is to calculate the probability of one of the possible paths then multiply that by the number of possible paths. In this example, there are three different combinations of success and failure that give me two out of three, so the chance of getting precisely 2 out of 3 failures is 0.009 * 3 = 0.027.
Mathematically, it’s p = P^f * (1 – P)^(n – f) * n! / (f! * (n – f)!), which breaks down as:
- n = sample size
- f = your number of observed events (failures)
- P = your probability (not percent) in your hypothetical state
Where f! isn’t a loud and obnoxious f-er, but the “factorial” of f, which is:
f! = f * (f – 1) * (f – 2) * (f – 3) … all the way down until you get to 1.
Likewise for n!.
That’s the process for getting the probability of precisely f out of n events. To get the probability of f or more events, you also calculate the probability of f + 1 events, f + 2 events, f + 3 events up to and including n, then add all those probabilities together. In the case of two or more out of three, the only path remaining is the chance of getting three failures in a row, which is 0.001. Add that to the 0.027 we got for precisely 2-out-of-3, and you get the final answer of 0.028.
Here’re all the binomial probabilities for a sample size of three with a hypothetical state of 10%. Just as you can sum the probabilities for each precise f to get the probability of f-or-more, you can also sum them the other direction to get the probability of f-or-less.
|f||p(precisely f)||p(f or less)||p(f or more)|
No, How you Really Calculate Binomial Probabilities
By now it’s clear that binomials are not so much complicated as boring. The probability of getting two or more out of three isn’t so bad, but if you have a sample size of 30 and you want to know the probability of getting 20 or more failures, that means getting the probability of precisely 20, 21, 22, 23, and so on to 30 failures, then adding them up. This is 2012. We don’t have that kind of attention span. So write a program or script to loop through the calculation of f-or-more failures (one of my earliest desktop programs -in Pascal -calculated binomial probabilities), or set up a spreadsheet that replicates the calculations through x-to-n number of cells.
Or why bother to re-invent the wheel? Excel, for example, has a BINOMDIST() function. Enter the observed count of events (f), the sample size (n), the probability from the hypothetical state, and TRUE for “cumulative,” and Excel gives you the probability of f events or less. That is, it gives you the left tail of the distribution. To get the probability of f events or more, calculate f – 1 events or less and subtract that from one:
p(f events or more) = 1 – BINOMDIST(f – 1, n, P)
For example, to get the probability of 2 or more failures out of three, calculate 1 minus the probability 1 or fewer failures.
This is reminiscent of what we did with the t-distribution when we wanted the p-value through the center of the distribution. A key difference is that the binomial distribution is discrete -the possible observed values are integers -while the t-distribution (like the normal distribution) is continuous, allowing any fraction. In a t-distribution the probability of x or more is essentially the same as the probability of one minus x or less. But in a discrete distribution, the probability of f or more is 1 minus (f – 1) or less.
If it’s too uncool to use old-fashion desktop application, you can search for a web or mobile app to calculate binomial probabilities. More on that in a minute.
Now you know where I get my Stat 101 numbers from. More practically, now you can calculate your own probabilities for values of P that I didn’t include in my tables (e.g., 16.7%, 3%, 0.01%).
Binary Choice Data
But supposed you fall through a moth hole in the fabric of space-time to 1978, a time when mobile apps, iTunes, the web, PCs, and computers were not routinely available or existing. You immediately notice that Mick Jagger is already old. Almost as horrific, you next notice that people’s attention span really isn’t significantly different than today’s. Statisticians didn’t have the patience to calculate binomial probabilities by hand back then either. Instead, they worked out short-cuts.
Suppose you wanted to improve the user experience for your users in 1978, by… umm… providing blank punch cards in an assortment of colors. As your users sit in windowless institutional cinder block keypunch rooms clacking away on battleship gray keypunch machines, maybe a little more color will trickle some sunshine into their dreary little lives (the users, not the keypunch machines).
You float the idea to Sven, the Supreme Lord of the Mainframes and the original Bastard Operator from Hell (BOFH). Sven is delighted by the idea. In fact, Sven is willing to consider investing in the more expensive colored punch cards if you can show there’s any preference for them among the users. You’re surprised that a BOFH would lift a finger to help users, but this is the original BOFH -he’s literally a bastard operator from Hell. With a tear in his eye, he’ll proudly tell you how is unmarried mother worked days and nights in his tiny hometown of Hell, Norway so he could get an education and become the Supreme Lord of the Mainframes.
To test users for card preference, you count how many users choose colored versus plain punch cards. To keep your users independent, you have to observe at a time and place when only one user at a time goes to get punch cards so that they don’t influence each other. Also, if you see the same user twice, you only count his or her choice for the first time in order to maintain independent observations (yes, his or her in 1978; just ask my sister-in-law). You also have to switch the shelf positions of the punch cards between each users to balance out any possible position effects (maybe users naturally tend to grab from the stack of cards on the left or right). After doing this for 40 users, it’s getting monotonous, but the data are very compelling: 29 out of 40 users took colored punch cards -almost 3 out of 4.
So let’s set Hypothetical State A to represent No Preference, which would mean the population chance of choosing colored over plain is 50%. Using your binomial app or BINOMDIST() function, you see that the probability of getting 29 or more users out of 40 for a 50% population rate is 1 – BINOMDIST(28,40,0.5) = 0.0032.
Looks like you have a solid statistical case for users preferring colored punch cards.
The Normal Approximation of the Binomial
Oh, right, you can’t get binomial probabilities so easily because it’s 1978 and your smart phone’s battery is dead from too many games of Angry Birds while you were waiting for users to come get their punch cards. This is when you appreciate the shortcuts pioneered by yesterday’s statisticians.
Take a look at the binomial distribution for 40 with a 50% hypothetical population rate. This graph gives the probability of observing each f precisely (i.e., BINOMDIST(f,40,0.5,FALSE)), so to get f-or-more, you’d add all the bars from f to the right tail of the distribution.
Looks like a normal distribution, doesn’t it? In fact, thanks to the Central Limit Theorem, binomial distributions tend to be normal distributions, much like sampling distributions of averages tend to be normal distributions. Any binomial distribution will tend to be normally distributed with:
population average = P*n
standard error = SQRT( n * P * (1- P) )
We can use the normal distribution much like the t-distribution by converting our observed statistic (29, the count of users who chose colored punch cards) into units of standard error of the normal distribution. The new converted statistic is called “z”. I tell statistics groupies it’s named after me. Really impresses them.
z = (f – hypo) / se
f = the observed count, 29 in this case.
hypo = the hypothetical state for counts (i.e., P * n, or 20 in this case).
It works just like the t-statistic except that we have the actual standard error, not an estimate based on the standard deviations in our samples. That means there are no degrees of freedom to worry about, and you use the standard normal distribution, rather than the t-distribution adjustment for using an estimated standard error. In Excel, that’s the NORMSDIST() function, which gives the one-tailed probability of z or less, so you have to subtract it from 1 to get z or more.
Accuracy of the Normal Approximation
Applying the normal approximation to our punch card data, we get:
z = (29 – 0.5 * 40) / SQRT(40 * 0.5 * 0.5) = 2.85
Using Excel’s NORMSDIST() to get the p-value, remembering that NORMSDIST() gives the p-value for f or less.:
p-value = 1 – NORMALDIST(2.85) = 1 – 0.9978.= 0.0022
Compare to the true value we got from the binomial distribution, that’s not bad for an estimation. I mean, 0.0032, 0.0022, whatever. The point is you can’t plausibly say there is no preference for colored punch cards.
But the normal approximation is only an estimation based on the assumption that your sampling distribution (the binomial distribution for your sample size and hypothetical state) is close enough to a normal distribution. It’s the same assumption underlying the t-test, so it’s accuracy can be foiled by the same combination of the things:
- The smaller the sample size the worse the accuracy.
- The more skewed the distribution the worse the accuracy.
For sample size effects, consider the probability of getting 6 or more events out of a sample of 8. The true binomial probability is 0.1445 while the normal approximation estimates it’s 0.0786, about half as much. Contrast that with the 0.0032 versus 0.0022 estimates above, where the estimate is off by about a third.
Skewness is related to P, the probability in your hypothetical state. The more P deviates from the midpoint of 0.5, the more lopsided the binomial distribution becomes. That makes sense: the lower the probability of each event, the more frequently you’ll get samples with low counts of events–the more they’ll tend to pile up at the low end of the scale. Here, for example, is the binomial distribution for a sample size of 40 when the population rate is 5%:
The opposite holds when the population rate of each event is over 50%.
For example, instead of measuring colored punch card preference, suppose you ran your 40 users through a usability test and counted how many failed to complete the task. You might need to run such a large sample size if the cost of both Type I or Type II errors are so large there’s little gap between Hypothetical States A and B . If Hypothetical State A is 0.05, then the true probability of observing 4 failures or more out of 40 users is 0.1381, but the normal approximation estimates it at 0.0734 -again half as much. Not to good to underestimate your error rate so much when costs are high.
Put both small sample size and skewed distributions together, and you can be way off. Remember that the true probability of 2 or more out of 3 given a 10% population rate is 0.0280? The normal approximation estimates it’s 0.000535!
What to Do About it
There are some attempts to reduce the inaccuracies of the normal approximation. Some of the inaccuracy of the normal estimate can be traced to the counts ( f) in the binomial distribution being discrete integers, but the normal distribution is a smooth continuous function. Yates continuity correction seems to me a reasonable and effective way to compensate for this, but apparently it doesn’t work as well as it should when you run it on realistic data, so it’s controversial today.
The real solution is simply don’t use the normal approximation. This isn’t 1978 and we’re not punching cards any more. Just about anyone with any access to today’s technology can calculate the true binomial probabilities.
The only reason I can see to use the normal approximation today is when you’re doing a quick check literally in your head. For example, Jason Cohen has described a simple rule that allows you to decide if an observed count has a p-value less than 0.05 given a population rate of 50%. I’ve personally used the rule that the 95% confidence interval of the observed rate of a binary event is about the same as the inverse of the sample size (1/SQRT(n), so you need a sample size of 100 to get results you’re very sure are within 10% of the true value). I use it to estimate the sample size of poll results on TV when they report the “margin of error.” Another way to wow the stat groupies. Both of these mental tricks use the normal approximation of the binomial distribution (technically, Cohen uses the chi-square distribution with one degree of freedom, but that’s a mathematical function of the normal distribution so it gives the same answer). Go ahead and use them, but realize you can be off by a factor of 2 pretty easily, and that either trick only works well with large samples sizes and population rates not too far from 50%.
The real reason I spent so much time on the normal approximation of the binomial distribution is that some apps out there use the normal approximation instead of calculating the true binomial probabilities. They seem to be many of the ones that come up when you google for A-B testing. Before you use a binomial calculator on the web or download a binomial app, check if it’s giving real probabilities or the approximation. If the documentation isn’t clear, test the app with a small sample size and population rate far from 50% and compare the results to what you get with BINOMDIST() or my binomial tables. For example, the true chance of getting three or more out of 10 when the population rate is 10% is 0.0702. The normal approximation will say it’s 0.0175 or maybe 0.0569 if it’s attempting to correct for continuity.
Another pitfall I’ve seen in some on-line apps is to fail to distinguish between one-tail and two-tailed testing. Like a t-test, a binomial test can be used to calculate the probability for a one-tailed or two-tailed test. In a one-tailed test, you calculate the probability of getting either an observed count or more, or an observed count or less, but not both at once. So far, all tests in this post and in Stat 101 were one-tailed tests. For instance, in our colored punch card test, we only considered the possibility of users preferring colored cards over plain cards. We weren’t interested in the possibility that users prefer plain cards over colored cards. As far as we were concerned, that wouldn’t be any different than no preference for either cards.
In a two-tailed test, you calculate the probability of getting an observed count as extreme or more extreme-you determine the chance of deviating both ways away from your hypothetical state at the same time. This is the procedure you use if you’re interested in the possibility of plain cards being preferred over colored -where that would impact your UI design (or punch card policy, in this case). For example, suppose plain and colored cards cost about the same, or suppose that, while usually plain cards are cheaper than colored cards, sometimes colored cards are cheaper than plain cards. Now it makes a difference whether users prefer plain cards versus users not caring. If users don’t care, you’d buy whatever cards are cheaper at the moment, but if users prefer plain cards, you’d buy plain cards, favoring them even if they’re more expensive. Or maybe there’s an important lesson to learn if users prefer plain cards over colored -something about users being distracted or annoyed by gratuitous attempts to improve aesthetics (although there are practical reasons to use colored punch cards too). All these are reasons to do a two-tailed test -to calculate the probability of getting a count as extreme or more extreme than what you’ve observed in your sample.
If your Hypothetical State A is exactly 50%, then the binomial probabilities are symmetrical, and you get the two-tailed probability for both Hypothetical State A and B the same way you do with a t-test -simply double your one-tailed probability. In our punch card example, the two-tailed probability of getting 29 out of 40 is 0.0032 * 2 = 0.0064, which is actually the probability of getting 29 or more or 11 or less.
When Hypothetical State A isn’t exactly 50%, then the binomial probabilities are no longer symmetrical, and the probability of one tail isn’t quite equal to the probability of the other. The most common procedure for getting the two-tailed value is the Method of Small p-values, which, in a nutshell, means finding the precise count of events in the other tail that have a probability equal to or less than the probability of precisely the observed count, then sum the probabilities from the count you found to the end of the tail. That can mean some hunting around if you have to do it manually with a one-tailed app or function like BINOMDIST(). Fortunately, rarely in usability testing do you ever have a two-tailed test with a Hypothetical State A other than 50%.
But the real lesson from this is that you need to check if your binomial app or function is giving you one or two-tailed probabilities. Given two-tailed has about twice the probability of one-tailed, it makes a big difference. Ideally, the app should give you the choice of one-tailed or two-tailed results since you may be using either in a usability test. If the app doesn’t say what it does, you can test it by comparing results with Excel’s BINOMDIST(), which gives one-tailed probabilities of an observed count or less.
Meanwhile, back in your time warp, you’ve found another opportunity to improve the user experience of your hapless keypunching users. You noticed that a thick cable crosses the route to the card reader in the I/O room. Occasionally, users stumble on it, and then some dump their carefully ordered stacks of punch cards on the floor.
You mention this issue to Sven the BOFH, but he’s skeptical. He points out that this is 1978 and all computer users tend towards the geeky side of the population distribution -they’re not exactly known for their physical coordination, so who’s to say they wouldn’t stumble and drop their cards anyway? (You admit you have witnessed that some users are apparently capable of stumbling while standing still waiting for the card reader to read their cards.) Indeed, Sven argues that the big fat black cable is a feature not a bug: it’s such an obvious tripping hazard that it encourages the geeks to walk more carefully reducing the frequency of stumbles from what it otherwise would be. Besides, the only alternative is to route users through a different door to the I/O room, but that’s a somewhat longer walk, which increases the chance for stumbles.
You can forgive Sven for being obtuse about modern usability/human factors. It’s only with today’s UX enlightenment that you never see the virtual equivalent of a big-ass tripping hazard on a web site. Besides, to his credit, the BOFH is open-minded -anything to improve the lot of his users. He’s willing to do an A-B test. You sit outside the I/O room and, by flipping a coin, direct each user to one route or the other to the card reader, and count the number of users that stumble as they pass you. Like before, you exclude users you’ve already observed once to maintain independence of your observations.
It’s important to know if the short way is better than the long way (as Sven theorizes) in addition to the long way being better than the short way (your theory), so this will be a two-tailed test. It’s safe to say that any change that reduces any stumbling and card dumping is worth considering, so that implies a Hypothetical State A of 50% -an equal number of users will stumble whichever way they go. For Hypothetical State B, you and Sven agree that a one-third reduction in stumbling rates is definitely worth favoring one route over the other (e.g., of the users that stumble, 40% stumble on one route, and 60% stumble on the other).
Of the 100 different users you observed, 19 stumbled on the way to the card reader, a 19% rate comparable to the percent of users that have a problem in a usability test of a modern interactive application. But more of interest to you, of those 19 users, 13 stumbled taking the short route over the cable, but only 6 stumbled taking the long route through the alternative entrance. In other words, in your sample, taking the long route reduced stumbling by more than a half. So you want the probability of 6 or fewer users out of 19 stumbling assuming 50% stumble in the population.
p(6 or less out of 19 given 50%) = BINOMDIST(6,19,0.5,TRUE) = 0.0835
This is a two-tailed test, so the p-value for observing something as extreme as 6-or-less out of 19 is 0.0835 * 2 = 0.1671.
There’s a pretty reasonable chance of seeing results like this when the true population rate is 50%. You don’t have to calculate the p-value for Hypothetical State B which stipulates only a one-third reduction. Given your observed one-half reduction, you’re not in the Whatever Zone where it seems like it makes little difference which route your users take. You start planning to run more users.
Correcting for Base Rates
Then the ever-helpful BOFH notices something in your data: of the 100 users you observed, 45 were sent the short way over the cable, and 55 went the long way. There’s no good reason to think there’s something wrong with the coin you were tossing. If you calculate the binomial probability for it, you’ll see it’s quite plausible to get deviations at least that large with a fair coin. But nonetheless you happened to send fewer users over the cable than around it. Sven reasons that if the route really makes no difference, then if you send 55 of all 100 users the long way, then 55%, not 50%, of your 19 stumbling users should have gone the long way.
Sven is right. You’re using the wrong population rate for Hypothetical State A. And you’re not the only one. Your hypothetical rates need to reflect the base or marginal exposure rates. If there is no effect, then the population rate of a condition is equal to the proportion of users exposed to that condition. Failing to correct for the base rates -for the number of users in each condition of an A-B test -can give you p-values that are bigger or smaller than they really are, depending on which condition gets the greater number of users. Failure to take into account the base rate is a common error not only among students of statistics but in less mathematically formal problems in our lives.
It’s easy enough to take into account the base rate with binomial probabilities. In this case, you sent more than half the users on the long route, which means you accentuated the exposure to stumbling on the long route, so your p-value is too small. For the number of users stumbling on the long route, Hypothetical State A is 55%, not 50%. Redoing the binomial probability:
p(6 or less out of 19 given 55%) = BINOMDIST(5,19,0.55,TRUE) = 0.0342
The p-value is down on one tail, as expected. But now you’re pissed, because that stupid blogger told you you’d wouldn’t have to do two-tailed binomial tests in usability for anything other than a 50% population rate, and here it is, only a few paragraphs later and you have to do precisely that.
Trust me here a minute.
If you were to follow the Method of Small p-values, you’d find the opposite extreme of 6-or-less out of 19 is 15-or-more out of 19 (the probability of precisely 15 out of 19 is 0.0203, just under the probability of precisely 6 of 19, which is 0.0233). So the other tail’s probability is:
p(15 or more out of 19 given 55%) = 1 – BINOMDIST(14,19,0.55,TRUE) = 0.0280.
So, the two-tailed p-value for seeing something as extreme as 6-or-less is 0.0342 + 0.0280 = 0.0622. Thank you, BOFH. That may to you look like an acceptable Type I error rate right there, leading you to conclude there be less stumbling if users always take the long route. However, it’s not good enough for Sven. Most of his users have scientific and mathematical backgrounds. They’re not going to be happy with him blocking the short route unless he can put the stamp of Statistically Significant on the results. Sven wants to see the p-value go below 0.05. So you’re back to running more users, although it’s not as bad as it was before.
Consistency and Inconsistency
While you were doing all these calculations by hand, Sven whipped up a Cobol BINOMDIST subroutine, and is enjoying printing out his own tables of binomial probabilities. He confirms your 0.0622 two-tailed probability for 6-or-fewer out of 19. Now he decides to look at those who took the short route, of whom 13 out of 19 stumbled, and the base exposure rate was 45%.
p(13 or more out of 19 given 45%) = 1 – BINOMDIST(12,19,0.45,TRUE) = 0.0342
That’s a familiar looking number. And sure enough, using the Method of Small p-values, he finds the other extreme is 4 or fewer:
p(4 or less out of 19 given 45%) = BINOMDIST(4,19,0.45,TRUE) = 0.0280
An exact mirror image of your results, with the same total p-value. Of course. Logically, they should have the same probability since you can’t have 6 out of 19 stumbling on one route without also having 13 out of 19 stumbling on the other route. 6-or-less out of 19 given 55% is synonymous with 13-or-more out of 19 given 45%. The same situation must have the same probability no matter how you state the problem. Math is truth and beauty.
Just for fun, Sven now calculates the probability of observing the users not stumbling when they take the long route. Nineteen of your 100 users stumbled, which means 81 didn’t stumble. You sent 55 users the long way, of whom 6 stumbled, so 49 didn’t stumble. Logically, the probability of 49-or-more out of 81 should be the same as 6-or-fewer out the 19 since they are simple restatements of the same situation.
Wait of minute. Sven has only done one tail so far, and his p-value is already far larger than what you got for the users that stumbled. Following the Method of the Smallest p-value, Sven finds the other tail starts at 40.
Total p-value = 0.1891 + 0.1827 = 0.3718
Two very different p-values for the same situation. One of them has to be wrong. Do you go with the users who stumbled or those that don’t? Which is right, 0.0622 or 0.3718? Why?
Fishers Exact Test
The answer of course, is that neither 0.0622 nor 0.3718 is right. The problem with both p-values is that they are using only part of the information you have available. What you need is a procedure that calculates the probability of the entire situation including both those that stumbled and those that didn’t. You want to include all the information in this 2-by-2 cross-tabulation table:
Looking at it this way, you see now that you really have two binary variables in your usability test -two separate sets of categories to classify your users. One is the route you assigned each user (long or short route) and the other is the response of the user (stumble or not). The binomial test is inadequate for this situation because it only has one binary variable. As is often the case when you use less than all the information available, you lose statistical power or “decision potential.” The p-values both you and Sven calculated are larger than they would be if you used the information from both binary variables at once.
This appears to be a common error in A-B testing. I can only guess they get away with it because the error in the p-value is small when the total number of conversions is much smaller than the total number of non-conversions, which is pretty typical. But why be even a little wrong? Use your computer to compute. Use all the information you have.
The procedure to calculate a p-value for the joint counts from two binary variables is Fishers Exact Test (not to be confused with Fisher’s F or Fisher’s Z; that guy Fisher did a lot of great stuff; probably had a lot of groupies too). While the binomial test calculates the probability of observing two mutually exclusive counts (failures and non-failures), Fisher’s Exact test calculates the probability of four mutually exclusive counts from cross-tabulating two binary variables with each other. Like the binomial test, you first calculate the probability of precisely the observed configuration of counts, and then you add to that all other more extreme configurations for either one-tail or two-tails, depending on your intentions with the usability test.
Unlike a binomial test, however, Fisher Exact tests for only one hypothetical state: that the proportions of each cell are equal to the base rate. To put it another way, the hypothetical state is that there is no correlation between your two variables that characterize your users -for example that the tendency to stumble is unrelated to the route a user takes. To put it yet another way, the hypothetical state is that the two binary variables are independent. By “independent” I mean the same thing we mean when we talk about users performing independently. When users are independent, the chance of one user failing or succeeding is unrelated to whether another fails or succeeds. When binary variables are independent, then the chance of a user falling in a particular category of one variable is unrelated to his or her category on the other. The chance of stumbling is not different whatever route the user takes, for example.
A state of uncorrelated or independent variables is usually what you want for Hypothetical State A. It’s what we meant to have for this case of alternative routes to the card reader. However, since that’s the only hypothetical state that Fisher’s Exact tests, you cannot test Hypothetical State B with Fisher’s Exact. At least, I haven’t figure out a way yet. Like I said, statistical analysis of counts is surprisingly complicated.
Calculating Fisher’s Exact
To calculate Fisher’s Exact, first figure your row and column totals.
Now consider the users you sent on the long route. Of the 100 users, how many different ways are there to divide them up so that there are 55 going the long way and 45 going the short way? To answer that we use the same formula for combinations that we used to calculate binomial probabilities: number of combinations = n! / (f! * (n – f)!). From now on, I’ll abbreviate that formula as COMBIN(n, f), which also happens to be the Excel function for calculating combinations. You use either the 45 or 55 for f since it means the same thing, and gives the same answer. That is, if there is x number of ways to get 45 out of 100 on the short route, there must also be the same x number of ways to get 55 out of 100 on the long route since every way of getting 45 on the short route is also a way to get 55 on the long route.
So, the answer to the question is:
Now of 19 total users that stumbled, how many different ways are there to split them into 6 of them taking the long route and 13 of them taking the short route?
COMBIN(19,6) = 27,132 (again, it doesn’t matter if I use 6 or 13, but I’ll use 6 to be consistent)
And of the 81 that didn’t stumble, how many ways are there to get 49 of them in the Long Route cell?
COMBIN(81,49) = 3.6219e22 (lots smaller than the first number, but still pretty impressive)
For every one of those 27,132 ways of filling the stumbler row there are 3.6219e22 ways to fill the non-stumbler row, so you can get the total number of different ways of filling both rows to arrive at the counts we’ve observed by multiplying the two ways together:
27,132 * 3.6219e22 = 9.8269e26
Divide that by the number ways of getting 45 and 55 total users on the short and long way, and that’s the proportion of ways to get the observed counts in each row given the totals of each column. It’s therefore the probability of getting the observed counts.
p = 9.8269e26 / 6.1448e28 = 0.01599
That’s the probability of the precise configuration of counts you observed.
As a general formula, the probability of a particular two-by-two crosstabulation table of counts is:
p = COMBIN(tr1,f1) * COMBIN(tr2,f2) / COMBIN(tc,t)
- tr1 is the total for the first row.
- tr2 is the total for the second row.
- f1 is one of the counts from the first row.
- f2 is one of the counts from the second row.
- tc is one of the column totals
- t is the total sample size.
That only gets you the p-value of precisely the observed configuration of observations. Now, similar to what you do with a binomial test, you need to calculate the p-value for each way of having a stronger relationship between stumbling and route while keeping the column and row totals the same. For example:
And so on.
If you’re doing a two-tailed test, then you also have to use the Method of Small p-values to find the corresponding configurations in the opposite tail, where there is proportionally more stumbling on the long route than the short route, and add in their p-values too.
For our stumbling data, that all sums up to a p-value of 0.0386. How do you like that? All this time your results met Sven’s requirement for statistical significance. It was just a matter of finding the right test.
I went through all that math above just so you know what Fisher’s Exact is up to under the hood. In practice you use an on-line or mobile app. I can recommend the one at GraphPad Software and one by Øyvind Langsrud. Unfortunately, Excel doesn’t have a Fisher Exact function.
Chi-square Tests of Independence
Just like you can use the normal approximation of the binomial distribution to estimate the p-value from a the binomial test, you can also use the normal approximation to estimate the p-value for the relationship between two categorical variables. That approximation is the chi-square (not “chi-squared”) test. It uses the chi-square distribution, which is sort of a normal distribution for multiple variables combined.
In general a chi-square test is useful for testing how well some observed values fits expectations. To use chi-square to test the independence of two categorical variables, we calculate the counts we would expect to get in each cell if the hypothetical state of independence were true. For example, overall 19/100 or 19% of your users stumbled on the way to the card reader. If stumbling is independent of route, then you would expect that of the 55 users that took the long route 19% or 10.45 would stumble (on average).
Expected count of stumbling users on long route = 55 * 19/100 = 10.45
Likewise, the expected number of user stumbling on the short route is the total proportion of user stumbling times the number who took the sort route:
Expected count of stumbling users on the short route = 45 * 19/100 = 8.55
Now do the same calculations for those not stumbling. The expected number of users not stumbling on the long route is the total proportion of users not stumbling times the number of users on the long route:
Expected count of non-stumbling users on long route = 55* 81/100 = 44.55
And likewise for the short route:
Expected count of non-stumbling users on the short route = 45 * 81/100 = 36.45
So you see a pattern here. For any cross-tabulation table of two categorical variables, if the variables are independent (uncorrelated), then the expected count fe for a cell in the rth row and cth column is:
fe for cells in row r and column c = tc * tr / t
Where tc is the total for column c, tr is the total for column r, and t is the grand total (your sample size).
To get the chi-square statistic, calculate the following for each cell:
2 * f * LN( f / fe )
Where f is the count you observed in the cell and LN() is the natural logarithm function. This form of calculating chi-square is known as the “G-test” to distinguish it from an older somewhat less accurate calculation. While the G-test is more accurate than the alternative, it is still using a normal approximation. For our data we get for each cell:
2 * 6 * LN(6 – 10.45) = -6.66
2 * 13 * LN(13 – 8.55) = 10.89
2 * 49 * LN(49 – 44.55) = 9.33
2 * 32 * LN(32 – 36.45) = -8.33
The older alternative calculation is suitable when you have no spreadsheet software or scientific calculator. It’s:
(f – fe)^2 / fe
(6 – 10.45)^2 / 10.45 = 1.89
(13 – 8.55)^2 /8.55 = 2.32
(49 – 44.55)^2 / 44.55 = 0.44
(32 – 36.45)^2 / 36.45 = 0.54
The chi-square statistic is the sum of these numbers:
Chi-square = -6.66 + 10.89 + 9.33 + -8.33 = 5.23
Or, doing it the old-fashioned way:
Chi-square = 1.89 + 2.32 + 0.44 + 0.54 = 5.20
Doesn’t make much difference, especially since it’s all an approximation anyway.
The p-value of a chi-square can be found with the CHIDIST() function in Excel. The degrees of freedom for a cross-tabulation of two binary variables is 1.
p = CHIDIST(5.23,1) = 0.0222.
p = CHIDIST(5.20,1) = 0.0226.
The chi-square test is always two-tailed, so that’s your final p-value. As you can see, it’s only approximately equal to the true p-value of 0.0386 that we got with Fishers Exact. Like the normal approximation of the binomial distribution, the larger the sample size and the more evenly the counts are distributed (i.e., the more the row and column totals are close to each other), the more accurate it is. In the case of our stumbling data, the sample size is pretty big, but the row totals are very different from each other, which accounts for the inaccuracy. A small count in one or more cells of the cross-tabulation table, like the of 6 stumblers on the long route, is a warning sign to avoid chi-square.
Also like the normal approximation of the binomial, there’s no particularly good reason to use the chi-square test these days when you can use Fisher’s Exact. The main reason I’m covering Chi-square is the same reasons I covered the normal approximation of the binomial:
- It’s handy if for some reason you don’t have access to an app for Fisher’s Exact. The formula is sufficient simple that you whip it up in a spreadsheet pretty quickly -much easier than doing Fisher’s Exact in a spreadsheet.
- Some on-line apps for calculating the p-value for the relation between two binary variables use chi-square rather than Fisher’s Exact, so you need to be aware when you’re only getting an estimate.
There is one other reason to know about the chi-square test, and that is it’s easily scalable beyond binary variables. For instance, you can use it with three or more different designs (e.g., three different ways of warning users about the cable), or three or more different user response categories (e.g., stumble, not stumble, and turn around and not go to the card reader). You merely add rows and columns of count data to your cross-tabulation table. You still calculate fe the same way for each cell, and still sum up [2 * f * LN( f / fe )] for all your cells to get your chi-square statistic. You’ll just be doing it to more than four cells.
The only change is to your degrees of freedom. For any cross-tabulation table, the degrees of freedom for a chi-square test is:
df = (r – 1) * (c – 1)
Where r is the number of rows and c is the number of columns in your table. For two binary variables:
df = (2 – 1) * (2 – 1) = 1
For a usability test with 4 difference designs and 3 different user response categories;
df = (4 – 1) * (3 – 1) = 6
The chi-square test is sufficiently simple you can use it for such elaborate usability tests. Fisher’s Exact, in contrast, gets very complicated quickly when you scale it to more than binary variables. You wouldn’t want to do it by hand, even in a spreadsheet.
A single-table Chi-square analysis is the appropriate way to do so-called multivariate testing, rather than breaking the test down into a series of 2-by-2 A-B tests. Consider: if your level of comfort of a Type I Error is 0.05, and you run 6 separate A-B tests, what’s the chance of at least one leading to a design decision when there really is no difference in conversion rates among any of them? Well, that’s just another binomial probability -sort of a meta-binomial probability, but it’s the same calculation:
p(1 or more, given a 5% rate) = 1 – BINOMDIST(0,6,0.05,TRUE) = 0.265
That is, you’re actual chance of making a Type I Error is better than 1 in 4, more than five times greater than if you did it all as a 2-by-6 Chi-square. If the p-value for the 2-by-6 Chi-square is low enough for you, indicating some kind of correlation between designs and responses, then you can do a series of 2-by-2 tests (using Fisher’s Exact) in order to check which design is apparently better than the other.
Be Careful Out There
You might be wondering if all this is just academic. Binomials, chi-square, Fishers Exact, they all pretty much say the same thing. Let’s return to the results at DIY Themes, where the Visual Website Optimizer calculator provided a p-value of 0.051 for getting 8 conversions out of 793 on one design versus 15 out of 736 on the other.
- It performs a one-tailed test. That’s wrong because the very fact that people are debating the counter-intuitive finding tells me everyone has an interest in either design out-performing the other.
- It uses the normal approximation, rather than exact probabilities. That’s very suspect with some cells having such low counts.
- It estimates the binomial probability for just the number of conversions, ignoring the information supplied by users that did not convert.
Plug the same numbers in to a Fisher Exact two-tailed test and you discover the true p-value is 0.14, almost three times greater than what Visual Website Optimizer reported. There’s a one in seven chance of seeing results like this when there is actually no difference between the designs. Personally, that is way too much of a chance of error for me to waste time trying to understand why this form of social proof allegedly backfired. Call me when there’s more data.
Problem: Statistical analysis of count data.
- Use the binomial probability distribution for independent counts in two categories.
- For preference counts of a binary choice, set Hypothetical State A to 50%.
- Select the appropriate number of tails.
- Check the number of tails your binomial app or function uses.
- Avoid using the normal approximation.
- Use Fisher’s Exact test for A-B test results.
- Select the appropriate number of tails and be sure your app supports it.
- Avoid tests that ignore the base rate.
- Avoid tests that ignore the number of non-conversions.
- Avoid tests the use a normal approximation for a simple 2-by-2 cross-tabulation. These include: Chi-square and the G test.
- Use Chi-square tests of independence (preferably the G-test) when testing more than two designs.
- Test all designs simultaneously in a single large cross-tabulation table.
- If the single large test indicates a relationship, then do 2-by-2 tests to compare designs with each other.
Here’s a spreadsheet of this post’s data and analyses.