A series on statistics applied to usability engineering
Now you’ve done it. You should’ve kept your mouth shut. There you were, comparing user ratings between a couple web home pages, and, to your disappointment, the amateurish proto-CSS design had better average survey rating than the fully researched state-of-the-art replacement you’ve been promoting. But you’re not about to give up. After all, there were only eight users in the test. Maybe that wasn’t enough. So you put on your most authoritative voice and try, ”Of course, we should do an inferential statistical test to see if this is just sampling error.” Whoa, dude, you know statistics? “Well, yeah, I took an intro course in college years ago, and so I know we should-” Cool, brudda. Go run the stats and tell us what it shows.
Uh oh. You remember what statistics is about, but do you remember how to do it? Lucky for you, you ended up on this web site. Here I’ll explain how to perform statistical analyses to answer questions like whether your sample size is large enough to draw valid conclusions. I’ll cover analysis of common measurement types used in usability studies and common usability test designs. That should be doable in a single blog post, no?
No. Sometime last month while writing this, It finally dawned on me this was too much for one post. So, instead I present to you a series of posts on statistics and usability, which, contrary to previous series like Learning from Lusers, I’ll post one after the other each month like a DJ playing a block of classic Led Zeppelin.
We start with:
Stat 101: Sample size, inferences about the user population, Type I and II errors, statistical significance, and statistical power.
Yes, Stat 101 from May 2010, which retroactively becomes the first in the series. Stat 101 covers the basic principles of statistics for those with no knowledge of stats. It includes tables you can apply to task failure rates observed in your usability tests. It concludes that:
- You can do statistical analysis of small sample size usability tests.
- You should do statistical analysis of small sample size usability tests.
- Doing so will often show that you have perfectly valid data from your small sample size usability tests.
If you haven’t read Stat 101, go ahead and read it, even if you’re pretty good with statistics. Stat 101 applies your typical academic statistics to the problem of user interface design, and thus there are some unconventional points, processes, and perspectives that aren’t covered in a typical college course that we’ll use in the later posts. And tell Wayne in Finance I said “hi.”
After Stat 101, we’ll go to the 200-series.
Stat 201: Comparison of averages, analysis of user ratings, improving on an existing UI design, estimating needed sample sizes, the normality assumption, between-subject testing, and one-tailed t-tests.
Stat 202: More comparisons of averages, analysis of completion times, choosing between two new UI designs, maximizing statistical power or “decision potential,” within-subject testing, two-tailed t-tests, and data transformations.
Stat 203: Analysis of frequency data, such as task completion rates and conversions, binomial tests, chi-square tests, and more.
And we’ll work with Surfer Stan.
The 200 series applies the concepts I covered in Stat 101, but moves beyond the simple case of how to interpret x number of users failing at task y in a usability test. Now were going to start looking at “relationship inferences,” which will help us decide which design alternative is better. At the same time, while Stat 101 is a pre-requisite, I’m not picking up where Stat 101 left off (that’s why it doesn’t start with Stat 102). We’re going to get into more sophisticated analyses, and I’m afraid I can’t compress an entire stat textbook into a few posts. The 200 series assumes you’ve had some stat, and the formulas I present are review -maybe a review from when there was such a thing as “record stores,” but a review anyway. If you are a usability engineer, and you’ve never had intro stats, what are you waiting for? Go take it now. Or at least read a book on it. Nothing like bending your brain around sampling distributions to wile away the Sunday afternoon.
How sophisticated will the analyses be? Well, you’re going to have to do some actual calculations this time, but nothing you can’t do with the built-in formulas of a spreadsheet. No need to rush out an plunk down the hefty licensing fee for SAS. We aren’t going to do any multivariate principle component maximum likelihood structural hazard modeling whatevers. Instead, I’m talking about the t-test, the chi-square test, and binomial tests (the latter actually introduced in Stat 101). The series is geared to people who occasionally need to compare human performance on a couple alternative designs. In particular, the 200 series is for UX practitioners who want to know which of two designs is better for the users, which is the most common statistical situation in usability engineering. I’m also assuming you merely want to make a reasonably well-informed design decisions, something better than, “oh, I guess that sample size is big enough,” and you don’t need high-precision academic-grade absolutely-the-best statistical tests. If you want the latter, then hire or contract a professional statistician with his or her own SAS license. All statistics is approximation. It’s just that some approximations are better than others. I’m aiming to go over some stats that are good enough for most needs in usability engineering.