What every user experience professional needs to know about statistics and usability tests.
Do you like computers, but hate math? Would you love to work on creating cutting-edge technology, but don’t think you have the quantitative aptitude to be a programmer or electrical engineer? Then become a user experience professional! If you can count to 5 (the number of users in a usability test), then you already know all the math you’ll need! Everything else is art! I bet you’re good at art, aren’t you?
Ha! Sucker! It turns out you do need to know some math to work in user experience. Being in UX means that sooner or later you’re going to have to deal with data on user performance or satisfaction, typically from a usability test. Even if you restrict yourself to design and leave the user research to others, you’re going to have to review the results of user research to inform your design work, so you’re going to need some concepts for evaluating that data.
Specifically, you need to know a thing or two about inferential statistics, the branch of statistics that helps you determine what you can reasonably conclude about your population of users based on what you’re seeing in your sample of users. Inferential statistics serve two basic purposes, both relevant to user data:
- Value Inference. Given an observed percent or average performance in your sample of users, infer what would likely be happening in the whole population of users. You know intuitively that the average or percent of user performance we see in a small sample of users could easily be off of what happens with users in general just because you happened to at random get some users that performed better or worse than your average users. Value inference tells you how far off that is likely to be.
- Relationship Inference. Given an apparent connection between two variables in your sample of users, infer if there is some sort of systematic or causal connection between the two. You might see a certain pattern in your sample, like that users who start on one page were more likely to make a purchase than those who started on another page. You know intuitively that it might be a coincidence -maybe you just happened to get more purchase-ready users on one page than on the other by chance. Alternatively, there might be some sort of underlying relationship. Maybe one page makes purchasing more attractive than the other, or (if users got to chose their starting page), maybe purchase-ready users are attracted more to one page than another. Relationship inference tells us the likelihood of such systematic relationships existing.
Why Avoid Inferential Statistics?
“I Don’t Trust Numbers”
Whether you’re making an value inference or relationship inference, inferential statistics are essential for establishing the confidence you should have in your data. Some may think it’s not worth bothering with statistics when overall test performance is only one of many things weighing in on design. But without establishing a level of confidence for that performance, how can you know how to weigh it in your design deliberations? How do you know whether you’re really learning something from a usability test verses just fooling yourself? Without establishing a level of confidence, how can you persuade your client on the importance of the problem your design is trying to solve? What do you do if a key stakeholder, say Wayne from Finance, says, “Bah, you only tested five users. That’s not statistically significant.” Just shrug?
O, Mr. Statistical Significance. Feared by some. Treated with contempt by others. It seems everyone would just wish he’d go away. I think most UX practitioners expect Siggy to say something about whether the sample sizes in our user research are big enough or not. This gives UXers the uneasy feeling that if they let him, Siggy will rip away the sham that is our small-sample usability tests that we’re so invested in. Like Galileo showing the Earth orbits the sun, he could prove faith in usability testing is misplaced. We Can’t Let Him Do That. However, like the misunderstood loner in a B-grade film, Siggy actually has something valuable to offer the community of UXers if they would take the time to get through his rough unfamiliar mathematical constitution.
Instead, I hear that statistics are not worth the effort in a typical small-sample usability test, or even that statistics don’t apply to small samples, as if being simple and loose excludes one from the laws of probability. I’m told things like, “Our sample size is too small for statistical analysis,” which is sort of like saying, “Our boat is too small to worry about overloading it.” Some seem to think it works to preemptively shoot the messenger before he delivers any inconvenient news. They say things like, “I know it’s not going to be significant with only five people, so why bother?” and off they go pretending the issue doesn’t exist. Right. I know my parachute isn’t packed properly, so why check it? Excuse me, I’ve got skydiving to do.
“I’m Being Qualitative”
It’s legitimate to argue that statistics are not relevant to small-sample usability tests because you’re depending primarily on qualitative data and analysis, rather than quantitative data and analysis. Qualitative analysis is an effective and perfectly scientific approach to answering the kinds of questions you want to answer with a usability test, especially when used as part of iterative design (i.e., formative rather than summative testing). In that case, you’re less interested in value inference than relationship inference. You don’t really care how many users have problems with your product. As an expert user experience professional, you assume that there will be problems with your product, and with frightful regularity, you’re proven right, despite your best efforts as an expert user experience professional.
What you want to know is what those problems are and, most importantly why. Without an ability to infer a relationship between the problem (e.g., the user getting loss) and the cause (some confusing link labels), you won’t know how to improve the product. To quantitatively infer relationships you have to anticipate both the problem and cause, and measure or control each, doing something closer to A-B testing than classical small-sample usability testing. Because of this, qualitative is superior to quantitative when you doing exploratory “let’s just stick a user in front of this thing and see what happens” kind of usability testing.
So you should most definitely collect and evaluate qualitative data. However, this should be done because it is the best methodology, not as a ploy to ignore statistical issues. In particular:
- Just calling it “qualitative” doesn’t make it qualitative. Some UXers seem to think that qualitative means simply being vague with your amounts, that as long as you don’t represent anything with Arabic numerals, you can skip the statistics parts. They phrase results using words like “some users” and “most users,” or show the amounts graphically, rather than giving a precise count or percent. Anytime you are dealing with relative amounts or levels of anything, you are dealing with quantities, and statistics are necessary. Anytime you graph amounts, you are using statistics, but not necessarily inferential statistics. True qualitative data are stories, sequences of linked events and perceptions, things that cannot be reduced to amounts without losing key information you need to make qualitative relationship inferences. Quantifying data necessarily means abstracting it. You cannot change quantitative data into useful qualitative data -you won’t have the rich narrative information you need for qualitative analysis.
- Being qualitative is not generally cheaper or easier than quantitative. While qualitative research is typically done on small sample sizes, the time and effort spent on each user is typically much higher than quantitative research. In quantitative research, data collection for one user may be limited to checking a box to record if a user completed a task or not. In analysis, that datum is processed in a few CPU cycles to be part of your results. In qualitative research, to get the rich information you need, you need to record much more about the user’s performance, where they pause, what they almost clicked on, what they said while thinking aloud, and how they answered your open-ended questions. In analysis, you have to pull all that together into a coherent narrative, which takes a whole lot of brain cycles. Qualitative research typically has small sample sizes not in order to be cheaper than quantitative research but because you wouldn’t be able to afford a large-sample qualitative study. While in some situations qualitative research is more cost-effective than quantitative, and vice versa, on average it’s a wash. Going qualitative by default is not going to save you time or money.
- Sometimes you do anticipate the problems and causes. While formative research and iterative design should always include opportunities for serendipitous findings, you make best use of your usability resources if you focus your research on specific design questions. While this doesn’t preclude using qualitative research to address the question, it at least implies you have the ability to conduct a quantitative study: you’ve hypothesized how a specific design will specifically affect user performance. You can vary the design on those specifics, and measure the specific result. When you have such specificity, a quantitative approach will typically give you a clearer answer to your question than a qualitative approach.
- Sometimes you need to make value inferences. Qualitative research can tell you about the relationship between two things, but it is not suited for telling you frequency or amounts of the things. However, sometimes that’s what you need to know. Sometimes it’s not enough to know if you have a problem with a product. You also want to know how serious the problem is, for example, how many of your users you expect it to affect. With limited usability resources, you may need to know if it’s worth justifying the expense of fixing the problem, or you may need to prioritize problems for attention, or you may have conflicting results from two different sources and need to weigh them against each other. Value inferences are almost always best done with inferential statistics. This post focuses on making value inferences from small-sample usability test results.
- Quantitative and qualitative methods are not mutually exclusive. You can use them both, sometimes on the same sample of users to to exploit the strengths of both. You can include open-ended questions in a quantitative study, which may provide insight into a specific individual’s (e.g., an outlier) quantitative performance. And you can conduct inferential statistical analysis on key quantitative variables in a small-sample usability test that is otherwise qualitative.
The Significance of Significance
Let’s pretend I’ve convinced you that it’s vital to apply inferential statistics to user experience problems. So what is statistical significance and why do you need to pay attention to it? The answers are that it’s just a social convention and you can ignore it.
Alrighty. It’s going to take some explanation on how I got to those answers.
Technically, “statistical significance” means the results you’re seeing in your sample of users cannot be plausibly attributed to random sampling. For example, let’s say you conduct a usability test with a paltry sample of three users, and two out of three have a problem with completing the task. A 67% failure rate sounds pretty bad, but could it plausibly be attributed to random sampling error? In other words, despite your efforts to get representative users for your test, you could have just happened to get a couple of weirdoes who can’t figure out a UI that would be perfectly clear to just about every other life form on the planet. With a sample of only three, we intuitively sense the this is pretty plausible. We can expect that Wayne from Finance thinks it’s very plausible, since he wouldn’t even accept results from five users.
However, this fails to take into account that in all probability Wayne doesn’t know his alpha from a hole in the ground. This is not something to judge subjectively. You can calculate the plausibility (probability) of two out of three failing a task if you make one reasonable assumption and you have one additional piece of information.
The one reasonable assumption is that your sample is a random sample of the population of users that you’re interested in. Making this assumption allows the laws of probability kick in, making the calculation possible. Now, of course, you didn’t take a strictly random sample of users, where every user in the population has an equal chance of being in your sample. You probably drew from a specific geographic location, and there were limits on your ability to persuade users to take part in your usability test. You are certainly drawing from a specific period in time. Your product will be used by users in the years ahead, but did you think to sample from the future? Aha! No random sample then!
However, you did attempt to acquire a representative sample of users, users that are known to have similar characteristics as the users in your population -similar knowledge, skills, and maybe even demographic variables. For statistical calculation purposes, a representative sample is at least as good as a random sample. If you’re thinking about how your psych methods professor drilled into your heads the importance of randomness, you’re probably thinking of random assignment to experimental conditions, not random sampling for inclusion in the experiment in the first place. Random assignment is quite different and infinitely more feasible than random sampling. Some might even argue that a representative sample is better than a random sample, that the results you see are more likely to be consistent with the population’s performance than just picking users by drawing names out of a physical or virtual hat. However, if you think about it, your actual means of selecting representative users probably depends on pretty crude criteria. For example, you might choose users with a certain range of years of computer and web experience, but that’s a pretty indirect measure of their mental models and actual understanding of computers and the web. By all means, try to get representative spreads of users, but most likely it will be about equally good as a true random sample. Which is good enough for our purposes.
(Parenthetically, you also need to assume independent performance among your users, where the chance of one user failing doesn’t affect the chance of another failing. However, that is generally not an issue in usability testing as long as:
- You tested your users separately rather than as a team or focus group so that they don’t influence each other.
- You have one number per user per variable. For example you shouldn’t have each user try the task twice in order to get a sample size of six rather than three. You can, however, combine performance on both tasks for each user (e.g., record 0, 1, or 2 failures per user) in order to use all data but retain a sample size of three.)
The Additional Information
Okay, we’ll buy the assumption of a random sample, so let’s calculate the probability of two out of three failing. To which you should immediately ask, the probability given what? In order to calculate a probability, you need some sort of anchor point, some sort of number to get traction on. For example, to calculate the chance of getting three heads in a row when tossing a coin, you need to have the chance of getting a head on one toss.
In the case of our usability test, we can calculate the probability of two out of three failures given a chance of one user in the population failing. Which, if we knew, we wouldn’t have to do the stupid quantitative analysis in the first place. But wait: it’s useful to use a hypothetical state of the population -a “for the sake of argument” chance of failing. And what’s the argument in this case? What we’re really interested in is whether this two-out-of-three failure rate means there’ll be an unacceptable failure rate when the product is released. In other words, we want to make a value inference.
Your Hypothetical Population State(s)
To calculate the the probability of two out of three failures you need an actual number for the hypothetical population state. Should it be that 50% of the users in the population fail? Or 25%? 10%? 0%? You can work with your stakeholders to decide what constitutes an acceptable failure rate to use as a hypothetical population state. One way to look at it is to compare the cost of fixing the design with the cost of not fixing the design. Maybe Wayne can even give your some numbers to figure this out. For example, suppose redoing the average design flaw takes a total of 10 worker-hours, working at $100 per hour, so the cost of a fix is $1000.
The cost of not fixing the design may be harder to quantify in dollars and cents. In business software, a design flaw might make the task take longer (costing more user time), reduce worker morale (maybe notch-up turnover), increase calls to technical support (requiring more staffing there), increase the chance of an untrapped costly error, and so on. For a consumer web application, a design flaw may annoy users, diminishing the value of the brand, ultimately resulting in lost sales or a reduced user base to sell advertising to.
Let’s say it’s a business app, and on average a single failure takes users 15 seconds to recover (a failure in the mild annoyance category). If users are paid $30 per hour, then each failure cost 12.5 cents in time alone. So if there are over 8,000 failures in the lifetime of the design, then it’s worth fixing the design (8,000 * $0.125 = $1000, the cost of fixing) just to save the worker’s time. Let’s say you have 100 users encountering the design once per day on average with 250 workdays per year, and the lifetime of the design is 2 years. That works out to 50,000 encounters over the app’s lifetime (100 * 1 * 250 * 2). Thus, the break-even failure rate is 16% per encounter (8,000/50,000). If the population failure rate is over 16%, you save money in the long run by fixing the design. If the population failure rate is less than 16%, then you save money by not fixing the design (taking into account only the costs of worker time).
The total cost of not fixing a flaw depends on the number of users and the number of times the users encounter the design over the lifetime of the product. Do the same kind of calculations on a e-commerce app with millions of users and hundreds of millions of dollars in sales, and you’ll see it’s worth it to fix a design flaw when only a fraction of the users encounter the flaw of whom only a tiny fraction fail of whom only a tiny fraction end up switching to a competitor.
In practice, you’re not going to be able to determine the break-even population rate with such precision. What you can do is come up with a range of population rates by playing with the numbers you have and making some educated judgments with your stakeholders. Figure the following end points of the range by answering the following two questions:
- Hypothetical State A: What failure rate would you regard as absolutely acceptable? Here you want a rate that everyone agrees is definitely in the “Who cares?” level.
- Hypothetical State B: What failure rate do you regard as absolutely requiring a fix? Here you want a rate that everyone agrees is clearly worth the cost of fixing.
Obviously Hypothetical State B is greater than Hypothetical State A. The true break-even failure rate would fall in between State A and B, a range we might call the “Whatever” zone.
The Answer is…
So we’re trying to calculate the probability of getting 2 out of 3 failures given Hypothetical State A and B. Let’s say that Wayne says both have got to be 0%; not 1% or 0.1%, but 0%. Failure is not an option. Ol’ Wayne is really a softy. He cares deeply about every user failure and insists on no failures ever. Any failure is definitely worth fixing. What’s the probability of two out of three users in your sample failing given the chance of failure in the population is 0%? (Show your work).
Answer: 0. You just saw two users fail, so you know it cannot be 0%. I mean, maybe you got the only two life forms out of millions of users that would fail, but that still means the failure rate cannot be 0%. That was easy. This is something that those who worry about small sample sizes seem to forget: irrespective of the sample size, if you ever observe users failing then there certainly are some users who fail with your product. It’s only a question of how many.
Let’s make it so I actually have to do some calculations. A 0% failure rate is clearly too ambitious. On something like a web site for the general public, you can tolerate a little failure. We’d probably feel pretty happy with a 1% or 2% failure rate. We can probably live with 1% to 2% of our users doing a work around, asking a colleague for help, calling support, reading the documentation (gasp), or, in the worse case, taking their business somewhere else. Maybe even 5% to 10% would be acceptable as long as we’re talking minor annoyances and recoverable problems, not total user meltdowns. In contrast if you’re getting a 50% failure rate even if they’re just minor annoyances, then your web site is starting to look like a usability disaster area. For this example lets go with:
- Hypothetical State A: 10% failure. Definitely don’t fix the design.
- Hypothetical State B: 50% failure. Definitely fix the design
That puts the break-even point for fixing-versus-not-fixing in the 20-40% range. That might be reasonable for some intranet apps with small populations of users.
With regard to statistical significance, we’ll use Hypothetical State A. We’ll revisit Hypothetical State B later.
What’s the probability of two out of three users failing when the population failure rate is 10%?
Crunch, crunch, ptui!
About one out of 36. That’s a pretty low chance. What’s it mean? Well, it’s very implausible that the population failure rate is 10%. We don’t know what it is, but it’s clearly not less than 10% -that would be even more improbable. It’s much more likely to be more than 10%. It means that it’s very likely if you release the web site as is, the failure rate will be above the “who cares?” level. The true rate is in the Whatever zone… or higher, perhaps well above the Definitely Worth Fixing point of 50%. With two out of three or 67% of the users in your sample failing, that would be hardly surprising. What should you do? Care. Show your users some love. Go fix the design. Probably the worse you can do is roughly break even.
To statisticians, the 0.028 number I calculated is symbolized by p, so we call it a “p-value.” A p-value represents the probability of observing a particular result (2 failures for 3) given a hypothetical state of the population (10% failure rate). What we just did is basically what nearly all inferential statistics is about. We calculated a p-value for an observed sample result given a well-chosen hypothetical state of the population (a “null hypothesis” in statistician’s lingo). We then looked at the p-value and decided it was too small to be plausible: it’s unlikely that failure rate is 10% (or less). It’s likely that it’s over 10%. It’s likely we have something worth worrying about. Note that while a p-value is an infinitely continuous number between 0 and 1, we need to make a binary decision based on it: we’re either going to work to re-design the product or not.
In nearly all cases, you’d probably agree that 0.028 is so low you gotta believe the real failure rate is over 10%. But what if the p-value were 0.10? Or 0.20? Or 0.50? Somewhere you have to draw the line and set a criterion. You can think of the p-value as the doubt about the hypothetical population state with respect to your observed results. A small p-value means high doubt about the hypothetical population state, and more confidence that that the actual population state is more in the direction of your sample result (in this case, the actual failure rate is over 10%). A big p-value means your sample results give you little reason to doubt the hypothetical population state. Maybe you have some other reason to doubt the hypothetical population state, but it’s not because of the quantitative results of your usability test.
Another way of thinking about the criterion p-value is it’s your willingness to be wrong. We decide to redesign the product because it’s very improbable the failure rate is at or below our “Who cares?” rate. Very improbable, but not impossible. In fact with a p-value of 0.028, in 28 out of 1000 usability tests you will be doing a redesign that is definitely not worth it. Statisticians call such actions a Type I error, a case of disbelieving the Hypothetical Population State A when it is in fact true. The probability of a Type I error is equal to your criterion p-value, the point at which you say, “I don’t buy that hypothetical state. Let’s act like it’s not true.”
By the way, the probability of a Type I error is symbolized by the Greek letter alpha. You know that thing Wayne from Finance can’t distinguish from a hole in the ground.
Determining Statistical Significance
Right, you’ve gotten a p-value of 0.028, and thus have very high doubt the population failure rate is 10% or less; rather you have very high confidence the population failure rate is over 10%. But is it statistically significant? That’s where the social convention part comes in.
As an UXer, you and your stakeholders can in principal use any criterion p-value you want depending on your tolerance of risk in the particular situation. Scientists, however, needed to standardize their statistical procedures, which included selecting a value to divide doubtful and plausible. In what may be regarded as a remarkable example of international cooperation, they decided as a community that the criterion of doubt is 0.05. It’s a social convention, but not an arbitrary one, taking into account the practical limits of theoretical research and what scientists are trying to accomplish. For scientists, the decision they make based on this criterion is whether they have new scientific knowledge or not. To do inferential statistics, scientists select a Hypothetical State A of the population that essentially represents the status quo of knowledge. If scientists observe something with a p-value of less than 0.05 given that status quo, then they conclude that the status quo is wrong, and they’ve discovered something new.
A 0.05 criterion is pretty strict. It basically says, “you really can’t dismiss this new information as due to sampling error and be taken seriously.” It means that with 95% certainty, something other than sampling error must be responsible for the observed result. Frankly, there aren’t a lot of things outside of mathematics someone can say they know with 95% certainty. I mean, people say “I’m 95% sure of such-and-such” all the time, but if you were to actually count how often they were really right when they say that, it’ll be something like 70%. I’m 95% sure of that.
The 0.05 value is the conventional scientific criterion for statistical significance. If a p-value is at or less than 0.05, then the sample result is “statistically significant.” It means one concludes that a particular hypothetical state of the population is false. On the other hand, if the p-value is above 0.05, it’s “not statistically significant.” It means one cannot conclude a particular hypothetical state of the population is false.
So, if we were to follow scientific convention, we’d recognize that (given a hypothetical failure rate of 10%) getting 2 out of 3 failures has a p-value less than 0.05.
Holy crap! We’re statistically significant with a sample size of only 3!
Bigger Effects Mean Smaller P-values
Now before you go telling all your friends that all you need is a sample size of three for statistical significance, let’s think about what we were calculating. The p-value is specifically the probability of getting two out three failures given a hypothetical failure rate of 10%. If only one user failed, we’d be asking for the probability of getting one out three failures given a hypothetical failure rate of 10%, which is a different number. In fact the p-value in that case is 0.271. So seeing two out of three failures is statistically significant, but seeing one out three is not. With a p-value of 0.271 there’s better than one in four chance of seeing one out of three failures (or more) when the population rate is 10%; improbable, but reasonably plausible. You’ve a pretty big risk of a Type I error if you invest in changing a design when just one out of three users failed. There’s a pretty good chance it’ll be a waste of money.
For that matter, maybe you have even stricter standards for plausibility than the scientific community. After all, 0.05 is far from impossible. Well, for you, it might interest you to know that the probability of getting three out three failures with a 10% population rate is 0.001 or 1 in 1000. Still possible, but really really unlikely. If you’re looking to be extra extra sure you’re not committing a Type I error in your redesigns, maybe you should only do them if three out of three users fail in the usability test. A p-value of 0.001 is fairly called highly statistically significant. Yeah, Wayne, with a sample of just three.
So here’re all the p-values for a sample size of three and a hypothetical population state of 10% or less:
|Observed Failures||Percent of Sample||p-value|
|0 or more||0%||1.000|
|1 or more||33%||0.271|
|2 or more||67%||0.028|
|3 or more||100%||0.001|
I’ve included the percents of the sample to show that the more the sample deviates above the hypothetical population value (10%, in this case), the smaller the p-value.
Minding Your P’s
These are the take-home lessons:
- Wayne is a jerk.
- You can’t just look at the size of a sample and decide if a result is statistically significant or not. The calculation of the p-value depends on the difference between the observed result and the selected hypothetical state of the population. The bigger the difference, the more likely it’s statistically significant. That’s why I said Wayne was probably blowing smoke when he said that results with a sample size of five were not statistically significant. He couldn’t possible know if it were significant or not without knowing the selected hypothetical state of the population and the observed result. A result can be significant with a sample of three. Likewise, a result can sometimes not be significant with a sample of 300,000.
- Statistical significance is not really the important thing. Statistical significance includes the social convention of the 0.05 breakpoint, which may or may not be suitable for your situation. Speaking as a scientist myself, the 0.05 breakpoint is not a bad one. It has served the scientific community well, and if you have nothing else to go on, it’s a good breakpoint to use in your own work. However, as a UI designer, your goal is to produce the best design you can, which is a different goal from a scientist trying to establish new knowledge. That may imply a different breakpoint than 0.05.
This is what I meant when I said you should ignore statistical significance. The main thing that statistical significance has to offer usability testing isn’t the ultimate “significant-or-not” label to slap on results, but rather the process of calculating the probability of a result given a hypothetical population state. Instead of attending to yes-no significance, you should attend to the p-values. They represent a purely mathematical concept that can be applied to any situation. Any time you report the result of a sample, whether it be big or small, you should also explicitly include the p-value. Then, you, your project designers, your stakeholders, or whoever is looking at your result can decide for themselves what they should do. Over at ABtests.com, I’m frankly getting tired of having to calculate the p-values myself. It really needs to be on the page right next to the “Higher Conversion!” stamp.
You should also include your hypothetical state of the population, unless it’s implied. This isn’t necessary in the case of an A-B test, because it’s understood that the hypothetical population state is that A and B have the same conversion rate.
To help you in using inferential statistics in your own usability tests, below are the p-values for a sample size of three for various hypothetical states (click for full size).
I’ve also prepared tables for the actual numeric values. The values are the p-values for getting X number of “failures” or more, where a “failure” can be whatever you define it to be, as long as it’s something you want less of. For example, maybe it’s when a user utterly bombs the task, or maybe it’s when they just make a single recoverable mistake on a particular part of the UI. Maybe it’s just when they take too long to finish the task. All that matters is that you’re consistent with all the users in your sample. The graphs and tables work for any case when you can divide user events into two mutually exclusive and exhaustive categories. To use them:
- Define failure.
- Select your Hypothetical State A
- Count how many users fail.
- Look up that count for your sample size and your chosen Hypothetical State, and read the p-value.
- If the p-value is low enough for you, consider the actual rate to be greater than the Hypothetical State A.
If you’ve got “successes,” (something you want more of in your design) then convert them to failures, where the number of failures is how many users didn’t succeed. Also convert your Hypothetical State A from a success rate to a failure rate by subtracting success rate from 100%. Sorry, I can’t do all of the math for you.
What About Hypothetical State B?
Well, what are the odds that statisticians would call something a “Type I error” when there isn’t also a “Type II error”?
To review, we’re getting the probability of seeing the result we saw in the sample given the “Who Cares?” Hypothetical State A. This p-value is the probability of a Type I error -redesigning the product when it definitely isn’t worth it, in this case. Say you redesign your product only if two or more users fail. Then from the table (or graph) we see there’s a 0.028 chance you’ll do a redesign when you definitely shouldn’t. Sounds good. You’re keeping with scientific tradition of holding your Type I error rate to 0.05 or less. But there are other considerations than keeping design costs down. We can also ask, what’s the cost if you don’t redesign a page? What about not redesigning when it would be definitely worth it if you did? That’s what Type II error is: when you should have considered the Hypothetical State to be wrong, but didn’t.
Here’s where Hypothetical State B comes in. It represents the hypothetical rate in the population where it’s most definitely worth doing a redesign. Let’s say you select 50% or greater for your Hypothetical State B. We can use the tables to see the probability of redesigning the product when you definitely should. That probability is called “statistical power.” Subtract it from one, and you have your Type II error rate, the probability of not redesigning when you should.
Let’s see: 2 out of 3 failures, with a hypothetical rate of 50%, gives you 0.500.
By only redesigning if two or more fail, you’re keeping your Type I error rate to 0.028, but your Type II error rate is 1- 0.500 = 0.500. You’re going to miss half of the design flaws that are most definitely worth fixing. Half of the problems that affect 50% of your users or more are going to make it to the final product. There’ll be frequent interaction errors! Slow task completion times! Loss revenue! User riots! The collapse of civilization!
In usability testing, a Type II error is just as bad as a Type I error. You minimize the total chance of error by balancing Type I and Type II error probabilities. That’s how we set our hypothetical states, with State A being Definitely Not Worth Redesigning and State B being Definitely Worth Redesigning, and the break-even Whatever Zone in between. Deviations from that zone in either direction are equally bad.
This is in stark contrast to scientific research where Type I errors are worse than Type II errors. A Type I error means you’re claiming to discover something that isn’t true, which would undermine the credibility of science and result in a lot of people doing unnecessary (and possibly dangerous) things based on erroneous information. It increases the chance of two contradictory facts being “proven” statistically, which would cause endless confusion. A Type II error in science means a discovery will have to wait for another day. It serves science to be conservative, to adopt a skeptical attitude to new ideas until they are empirically supported with high confidence.
With this in mind, let’s return to our table for a sample of three. What sort of result would best balance Type I and Type II errors? Let’s look at one failure out of three. Given a Hypothetical Population State A of 10%, the Type I error rate is 0.271. Given a Hypothetical Population State B of 50%, statistical power is 0.875, for a Type II error rate of 0.125. Those aren’t great odds, but they’re a better balance than 0.028 and 0.500.
Given your choices in this particular example, it looks to me like you should fix every problem you detect in every user. That looks like the best balance of power and Type I error rates for a sample size of three with the selected hypothetical states. Inferential statistics for usability tests boils down to this selection of such a minimum critical result (1 in this case) for deciding whether to redesign the product or not. When reporting your decisions, you should report the critical result and it’s associated Type I error rate and power so your audience understands the chance you might be wrong in your decisions. And statistical significance? Screw statistical significance. If you give your Type I error rate and power, then you have no need for potentially misleading yes-or-no statements of statistical significance. Even the sample size is irrelevant.
Notice there is a trade-off between Type I error rate and power. Redesigning when two out of three fail means low Type I error rate (0.028), but little power (0.500). Redesigning when one out of three fail means good power (0.875), but you had to compromise Type I error rates to get there (0.271). That seems fair. You can’t have everything.
What if you don’t want to compromise? Maybe you can’t accept choosing between a Type I error of 0.271 or power of 0.500. Let’s say you just don’t like those odds. It means too many mistakes, too often redesigning things that don’t need redesigning, or too many things needing redesigning not getting redesigned. Maybe the Type I or Type II errors are too costly to accept those kinds of rates. There is something you can do about it: increase your sample size. Take a look at a sample size of five for example (click for full size; also on table page):
Specifically, look at getting two out of five: Type I for 10% is 0.081. Not bad. Not too far from the statistically significant standard for scientists. Power for 50% is 0.813. Also not bad. Frankly most scientists would be quite happy with that kind of power. The fact that larger sample sizes reduce error is probably consistent with your intuition. Bigger samples mean more confidence in the results, less chance you’ll think the wrong thing one way or the other. However, it’s not like there’s some magic threshold sample size that ensures your results are reliable. It’s all a matter of degree. Bigger samples are like a bigger microscope, allowing you to more reliably detect smaller differences in things. Bigger samples mean more costly usability tests, so they only make sense if you need to save on the costs of Type I and Type II errors. There’s statistical justice in the world. You want to make fewer errors? Then you better be ready to expend greater effort for it. You get out what you put in.
(For more on how to select a sample size for a usability test, see Lewis, J. R., (2006). Sample sizes for usability tests: Mostly math, not magic. Interactions, 13(6), p29-33. Note, however, that Lewis is only concerned with achieving sufficient statistical power, and doesn’t really address Type I error rates. On the other hand, in many projects, such as when designing for a large population of users, that’s appropriate because then it’s worth fixing a design flaw if even a fraction of a percent of your users have a problem; Type I errors are no longer a realistic concern and all you need to specify is Hypothetical State B.)
If you have ever been puzzled on why usability testing works with sample sizes of only three to five, now you see that from a statistical standpoint, it really isn’t all that bad. You actually have a pretty good chance of detecting serious design flaws that would affect most of your users while not wasting time on design flaws that affect a small fraction of your users. Far from being a threat to usability practices, inferential statistics validates them. From a strictly rational mathematical perspective of minimizing costs due to (1) doing unnecessary redesigns, (2) letting design flaws make it through to production, and (3) conducting usability tests, small sample size usability tests are often the optimal solution.
The primary limitation of a small sample size is it has a lot of uncertainty about how big a given problem is once you decide you have one. For example, you may know that it very likely affects at least 10% of your users, but it’s reasonably plausible the actual value is anywhere between 10% and 80%. For most usability situations such imprecision is acceptable. If you’re going to fix all detected problems anyway, who cares which ones are the worst?
If you still find yourself distrustful of small sample sizes, forget the math and look at it this way: if a problem exhibits itself in a large portion of your user population, you’re very likely to see it in a small sample; if the problem exhibits itself in a very small portion, you’re very unlikely to see it in a small sample. Small sample usability tests, in other words, are a wide-mesh net, convenient for likely catching only the big problems –which are often the ones the client primarily cares about.
There’s also statistical justice in the errors themselves, in the sense that if you do make an error, it probably won’t be a severe error. In this example with 0.813 power, you have a 0.187 chance of missing a design flaw that affects 50% or more of your users. Could it affect a lot more than 50%? Sure, it’s possible, but not likely. Look at the table for the chance of not deciding to fix a flaw that affects 80% of your users: 0.007. Most likely if you make a Type II error, it’ll be for problem near the 50% mark. Likewise for Type I errors. If you end up fixing a flaw that isn’t worth fixing, it’s more likely to help 9% of your users than 1%, which may not be cost effective, but isn’t a total waste of design resources, especially if I personally happen to be in that 9%. The punishment is proportional to the crime.
Problem: Using inferential statistics to make rational redesign decisions based on small sample size usability tests.
- With your stakeholders, choose Hypothetical States A and B.
- Hypothetical Population State A: your agreed-on population rate of failure that is definitely not worth redesigning for.
- Hypothetical Population State B: your agreed-on population rate of failure that is definitely worth redesigning for.
- Using the tables of this post, lookup the p-values for States A and B for your sample size.
- Select a critical result that balances these p-values, giving you a low probability for State A and a high probability for State B.
- The p-value for Hypothetical State A is your Type I error probability -the chance you’ll redesign something when it is definitely not worth it.
- The p-value for Hypothetical State B is your statistical power -the chance you’ll redesign something when it is definitely worth it.
- If there is no possible result associated with acceptable combinations of p-values, then increase your sample size.
- If you observe users failing at or above your chosen critical result, then you should redesign your product.
- Rely on qualitative analysis to decide how to redesign the product.
That’s it. Make your choices and place your bets.
Stat Geek Corner
All probabilities in this post and in the tables are one-tailed binomial calculations. I use one-tailed calculations because in a usability test (unlike theoretical scientific research), there is generally no interest in cases where the population failure rate is less than the null hypothesis value; all you care about is if you have to redesign the product or not. Besides, with sample sizes like these, you need all the power you can get.
Binomial statistical tests make no assumptions regarding normality of the sampling distribution or other things parametric tests do, making it ideal for the small sample sizes used in usability tests. It’s main disadvantage is that it’s limited to binary events (e.g., failure or success). If you’re dealing with averages or more than two categories of performance, other statistical tests are more appropriate, but more complicated to use correctly. I could write another post about those tests for those of you who’ve at least had an undergraduate statistics course sometime in your past. Let me know if you want me to.