### Comparing averages, Part 1

The second in a series on inferential statistics, covering:

- Improving on an existing UI design.
- Estimating needed sample sizes.
- The normality assumption.
- Between-subject testing.
- One-tailed t-tests.

**Prerequisites:** Stat 101, a college-level intro stat course or equivalent sometime in your life, surfer lingo.

Here’s the scenario: Surfer Stan’s ecommerce site hasn’t been updated since single-fin boards made a comeback. Stan contacted you because he heard you had gnar-gnar technology to make one sick user experience. That means really good, he assures you. But to get the full contract, you had to prove it. You agreed to build an interactive prototype of the home page and test it out head-to-head against the existing site. You were looking to improve both the aesthetics and the clarity of the design. You thoroughly studied your users and the surfing domain, and got a good handle on, not only how your users see the task, but the values and culture of surfers. Users seemed really thrilled by the early prototypes. They were particularly impressed with how the pulldown menus curl out like breaking waves. You were expecting users to rate it much better than the inconsistent and unharmonious old version. However, on the big day when the fully functioning new version took on old, here’s what you found:

Old Home Page | New Home Page | Both Pages | ||

Data | ||||

User | Rating | User | Rating | |

1 | 23 | 2 | 26 | |

3 | 20 | 4 | 29 | |

5 | 31 | 6 | 21 | |

7 | 16 | 8 | 13 | |

Statistics | ||||

Sample | 4 | 4 | 8 | |

Average | 22.50 | 22.25 | 22.38 | |

Std Dev | 6.35 | 6.99 | 6.19 | |

Difference in averages |
-0.25 |

Look at the statistics. You remember standard deviation (Std Dev), right? It’s roughly how far on average the numbers are from their average. In this case, most of the eight ratings are within about 6 points of the average 22.38. We use the “unbiased estimate of the population standard deviation,” the one with the weird (n – 1) in the denominator of its formula. It’s the STDEV() function in Excel. Also check out the distribution of the data, graphing it as a histogram or number line if you like. It’s got a reasonable bell-shape to it, with most scores clustered in the middle around 22, and few at the extremes. Could be normally distributed, which would be nice, but, as we’ll see, it doesn’t have to be for statistical analysis.

Something else should jump out at you: the averages. The new design scored *lower* than the old design. The difference in the averages (new average minus old average) is -0.25. That’s a negative statistic in more than one sense. What happened? Nothing wrong with your methodology. You tested a total of eight users, assigning four at random to try each version. That’s not an atypical number for a usability test. But can you trust the results? With only four users assigned at random to each version, maybe you happened to put one or two negative nutters on the new version. I mean, check out the dawg who gave the new design a 13. He’s way out there. Is there a reasonable chance you just had bad luck? Put it this way: what is the probability of seeing a -0.25 point difference by chance when eight people randomly use one or the other version?

### Set Your Hypothetical States

Here’s where we return to what we learned in Stat 101. When Stan asks you, “what’s the probability, bubba?” you ask back, “the probability *given what*, duder?” You and Stan need to decide on Hypothetical States A and B.

Hypothetical State A: What true improvement in ratings is definitely *not* worth the new design?

Hypothetical State B: What true improvement in ratings *is* definitely worth the new design?

In some cases, any improvement at all is worth at least considering. That is, Hypothetical State A should be any difference in the ratings greater than zero (using the arbitrary convention of subtracting the old rating from the new rating). That’s probably what you were taught to do in stat class, and it makes sense for major sites with a lot of users where even a minuscule improvement translates into big total gains that will pay for the cost of redesign in no time. It can also make sense for niche sites with few users if the site is going to be redesigned anyway (or created in the first place), and the client wants some indication that *your* redesign in particular is going to help. However, Stan has a very small operation and he already has a functioning web site that’s doing okay. To justify the expense of building a new one, you have to do better than just *any* improvement.

There’re several ways to arrive at your hypothetical states.

**Equivalent Cost**. If you can somehow estimate the increase in revenue for each incremental improvement to the site, then you can figure how much improvement you need to pay for the site redesign in a reasonable amount of time. For example, if each point of improvement on the satisfaction scale is correlated with 2 more conversions per month totaling $500 in revenue, then to pay off a $36,000 site redesign in two years, you need at least a 3.0 point improvement. That would definitely make the redesign worth doing (Hypothetical State B). The improvement that’s definitely *not* worth redesigning (Hypothetical State A) would be somewhat less than 3 points, when you consider that immediate conversion in a single session isn’t the only consideration -brand loyalty and word-of-mouth have value too. Generally, you’re not going to be able to translate scale scores into conversions, however -you won’t have the data. Stan certainly doesn’t have that data.

**Percentile Improvement**. If you’re using a standardized scale, such as SUS, or a scale for which you have data on a lot of sites, then you can shoot to improve the percentile ranking of the site. For example, you and Stan may agree that improving the site 10 percentage points is definitely worth doing while improving it only 2 percentage points is definitely not worth doing. However, in this case, you’re using a scale tailored specifically for surfers (real surfers, not web surfers) to measure specifically the kind of experience that Stan wants to achieve, so you have no data on percentiles.

**Scale Details**. You and Surfer Stan take a third approach, which is to look at the scale items and judge what change would qualify as a sufficient improvement. For Hypothetical State A, you select the “who cares?” level of improvement, while for Hypothetical State B, you select the “effing yeah!” level of improvement. In this case, let’s say each user’s rating is the sum of five 7-point Likert scales for items like “The home page is totally righteous” (1 – strongly disagree to 7 – strongly agree), so the possible lowest score is 5 (all ones) and the highest is 35 (all sevens). From this, Stan reasons that if the new design on average only improves one item by one point (e.g., moves “totally righteous” from 4, neutral, to 5, “somewhat agree,”), then it’s definitely not worth it. In other words:

Hypothetical State A: The difference in the averages is 1.00 points.

Alright.

Of course, right now the difference in averages is *negative* 0.25 points -your new design performed *worse* than the old design, so it’s looking like an uphill battle.

Following the same line of reasoning, you and Stan agree that if on average most items (3 out of 5) improve by one point on the scale, then it’s definitely worth going for the new design. So,

Hypothetical State B: The difference in the averages is 3.00 points.

This makes a “whatever zone” of 1.00 to 3.00 -a range of population values where it doesn’t make much difference to Stan whether he stays with the old design or proceeds with the new design. For some reason, I think Surfer Stan would appreciate the “whatever zone.”

If your stat class is coming back to you, you probably recognize Hypothetical State A as corresponding to the “Null Hypothesis” (H_{0}), which is correct. You may also recognize Hypothetical State B as the “Alternative Hypothesis,” which is completely wrong. For one thing, A and B are not exhaustive -it’s possible for both to be wrong about the population. Most intro stat classes, and most advance classes for that matter, don’t cover Hypothetical State B or its equivalent. Setting and testing Hypothetical State B is a procedure I made up to ensure there’s adequate statistical power in the analysis, which you probably didn’t worry too much about in your intro stat class. Such classes are geared towards applications in science where Type I errors are much worse than Type II errors. It’s true that all scientists try to get as much power as they can (by that I mean statistical power, not evil scientist take-over-the-world power). However, rarely do scientists try to quantify their power. But we’re doing usability testing where a Type II error is as bad as a Type I, so we’re going to be setting and testing Hypothetical State B in addition to A.

At this stage, it may be a good time for you and Stan to discuss how much risk of error he’s comfortable with. Soon you’ll be looking at p-values and have to decide if it’s sufficiently low to make a go/no-go decision on whether to proceed with the re-design. So, Stan, what chance would you tolerate being wrong? What chance are you willing to take that you’re redesigning the site when you definitely shouldn’t, or not redesigning the site when you definitely should?

Scientists use a 0.05 probability -that’s their “level of statistical significance.” Stan may be a thrillseeker among major brutal waves, but he’s pretty conservative with his business. But still, a 0.05 probability strikes him as a little strict. He says he’s happy with 0.10 -a 10% chance being wrong. Dude, that’s 90% chance of being right.

We now proceed to calculate the probability of getting a -0.25 point difference if Hypothetical State A is true. Our logic is this: If our observed difference is implausible assuming Hypothetical State A is true, then we proceed with the redesign. That is, if the probability of getting at least the observed difference is less than 0.10 for the Definitely Don’t Redesign state, then we conclude that the true difference in the population is higher -it’s at least in the Whatever Zone (where redesigning isn’t an appreciably *harmful* choice to Stan’s business), and may be at or above the Definitely Redesign threshold.

### The Normality Assumption

The t-test is the tried-and-true procedure for calculating the probability of an observed difference in averages. It can detect differences in samples when other statistical tests can’t. You might think data is data, but some test procedures are more efficient than others, acting like a more powerful microscope to detect smaller effects. Among all test procedures, the t-test ranks among the top.

But the t-test has a catch: it’s only accurate if the sampling distribution is *normal* -if it has that bell-shape that was probably on the cover of your stat textbook. The sampling distribution, is not, contrary to its name, the distribution of your sample data. A t-test does not require that your scale scores be normally distributed. That’s good, because a normal distribution isn’t just any bell-shape. It’s a very specific bell-shape. Our data is bell-shaped, but I don’t know if it’s a normal distribution. More to the point, I don’t care, because only the *sampling distribution* has to be normal.

So what’s a sampling distribution? It’s that thing that you never quite got in stat class. One minute you’re doing histograms and other pretty pictures, and stats is so easy, then POW! Along come sampling distributions, and you lucky to eek out a C+. So never mind. Move along.

### Sampling Distributions

Okay, I’ll give it a try. A sampling distribution is this: You’ve got a difference of two averages equal to -0.25 points. You want to know the probability of seeing that -0.25 in a sample of eight users. I mean, if you did the usability test again on a different sample of eight users, you’ll almost certainly see a different difference in the averages. Maybe it would be -1.00 points. Maybe 2.50 points *in favor* of your new design. Now imagine you’re the god of statistics. You can run a usability test on *every possible sample* of eight users. Zillions of usability tests. For each one, you get the difference in the averages, zillions of differences. Graph those zillions of differences as a histogram. *That’s* a sampling distribution. A sampling distribution is the distribution of numbers (statistics) calculated from every possible sample. Each entire sample supplies *one* number to the distribution, not all of its data.

Only the gods have ever seen an actual sampling distribution, but fortunately our ancient ancestors, the high priests of statistics, discovered the central limit theorem, which proves mathematically that sampling distributions involving averages tend to be normally distributed. “Tend to be”? I admit you’d expect more certainty from a mathematical proof, but practically you can assume your sampling distribution is close enough to normal if your data distributions are even vaguely bell-shaped and symmetrical, with extreme scores on one side of the average balanced by extreme scores on the other. Rating scales tend to be bell-shaped and symmetrical, so I think we’re cool on the normal sampling distribution requirement.

If you’re still with me, you can see why the sampling distribution is so important to our problem. If we know the exact distribution of the difference in averages for every possible sample, we can tell how likely we’d get our particular difference in averages (-0.25 points). Is it right in the fat part of the sampling distribution where there are a lot of possible samples that produce about that number? Then there’s a high probability we’d see -0.25 points. Or is it off in one of the tails of the sampling distribution where there are relatively few possible samples that could produce it? Then we have a low probability we’d see -0.25 points.

### The Standard Error

The central limit theorem tells you the shape of the sampling distribution, but to get those probabilities, you need to know the size of the sampling distribution. Fortunately the high priests of statistics come to the rescue again. They discovered there is a mathematical relation between the sampling distribution and the statistics in a sample. You can estimate characteristics about the sampling distribution from your sample data. Specifically, the estimated standard deviation of the sampling distribution of the difference between two averages from two separate groups of users is:

*se* = SQRT(*s1*^2/*n1* + *s2*^2/*n2*)

*S1* and *s2* are the standard deviations from each of your groups of users, while *n1* and *n2* are the samples sizes for each group of users.

The estimated standard deviation of a sampling distribution is called the “estimated standard error,” but I’ll just call it the “standard error” for short and symbolize it with *se*. They call it the standard error because it’s the amount of sampling error you could easily expect to get from your sample.

So the estimated standard error for your rating scale data is:

*se* = SQRT(6.35^2 / 4 + 6.99^2 / 4) = 4.72

You, mere mortal, now know that if you could run every possible sample of eight users through your usability test, the standard deviation of those zillions of differences of the averages will be about 4.72. You’ve observed a -0.25 point difference between the averages, but you now know that -0.25 points could easily be off by 4.72 points from the real difference in the population. The real difference could be more negative… or it could be positive, where the new page is better than the old. That’s notorious, bro! You *are* like a god!

### Take your Tails

Now that we know the shape and size of the sampling distribution you can calculate the probability of seeing a -0.25 point difference given a hypothetical true difference of 1.00. Actually, the probability of getting exactly -0.25, or 0.00, or any single exact number is pretty small because, no matter what, there’s a small number of all the possible samples that have exactly that difference. Even the chance of getting a sample difference of exactly 1.00 -precisely the hypothetical population difference -is relatively small (though larger than any other probability), given you’d expect some deviation due to sampling error. Getting 1.00 in your sample would be like getting exactly 500 heads after flipping a coin 1000 times. It would be a big coincidence, in a weird way.

What we actually want to do is calculate the probability of a *range* of values that start from the observed difference that would *include all values beyond the observed value that would lead to the same design decision*. This will include either all values greater than your observed difference or all values less than your observed difference. For the case of Hypothetical State A, if the observed difference is sufficiently larger than 1.00 (the “who cares?” threshold), then you decide to redesign the site. Since we’re looking for *larger* values, we want the range *greater* than or equal to our observed difference. We will calculate the probability of getting -0.25 or more given a hypothetical true difference of 1.00. This will include all differences from -0.25 through the right end or “tail” of the sampling distribution

For Hypothetical State B, if the observed difference is sufficiently smaller than 3.00, then you decide to not redesign the site. Since we’re looking for *smaller* values to make a design decision, we want the range *less than* or equal to the observed difference. We will calculate the probability of getting -0.25 or less given the hypothetical true difference of 3.00. This will include all differences from -0.25 to the left tail of the sampling distribution.

That’s the pattern of greater-than and less-than that you get when the larger the values are “better” (i.e., lead you to favor the new design over the old). If you were using a different measure of product performance where the smaller the value the better (e.g., task completion time), it would be backwards: For Hypothetical State A, you’d want the probability getting less than or equal to your observed difference, and for Hypothetical State B, you’d want the probability of of getting greater than or equal to your observed value.

### The t-test

Here’re the steps for a t-test, applied to Hypothetical State A, the probability of getting a -0.25 point difference or more given the true difference is 1.00.

**Step 1. Calculate your standard error** from the sample standard deviations. Oh, right. We already did that. It’s 4.72.

Right away, you can predict this t-test is not going give you a small p-value that would lead to a design decision. You’ve an observed difference of -0.25, which is only 1.25 points from Hypothetical State A -that’s less than the standard error of 4.72. Your observed difference is well within range of how much an observed difference *typically *deviates from the hypothetical value from sample to sample. It has to be pretty plausible that you’d observe -0.25 given a hypothetical true of 1.00. But let’s forge on.

**Step 2. Calculate your observed t-statistic**, which represents the deviation of your observation from the hypothetical state in units of the standard error. A

*t*of 1 means the observed difference is 1.00 standard errors away from the hypothetical difference (which would be plenty plausible). Mathematically,

*tstat* = (*diff* – *hypo*) / *se*

The numerator of the t-statistic formula captures the deviation between the observed difference (*diff*) and the hypothetical difference (*hypo*). Dividing by the standard error converts your units from the units of the scale (5 to 35 points) to universal units of number of standard errors. This way, we don’t need to custom-make a sampling distribution for every variable you measure and test. We can compare your difference in “t-units” to a single standard t distribution. A t distribution is a normal distribution with an average of 0 and a standard deviation of 1, with some small adjustment to its shape to account for the fact that we only have an estimate of the standard deviation based on a sample of a particular size. Since our sampling distribution is (close enough to) normal and we estimated the standard deviation (the standard error) from a sample of eight users, it’s just the thing we need. Bow down once again to the ancient statistical priests.

So with our data,

*tstat* = (-0.25 – 1.00) / 4.72 = -0.26

**Step 3. Get the observed p-value for your t-statistic.** If you had stat back when all surfboards qualified as guns, you’d used printed tables for this. Today, we can use a spreadsheet function. In Excel, it’s TDIST(), which takes as parameters your t-statistic, your degrees of freedom, and the number of “tails.” The degrees of freedom are used to adjust the normal distribution to account for using an estimated standard error. The degrees of freedom for two separate groups of users is:

*df* = *n1* + *n2* – 2

Or 4 + 4 – 2 = 6 in our case.

For now, put in one for the number of tails. We’ll get to two-tailed tests in Stat 202.

Now if you’ve done exactly like I said with Excel, it gives you an error rather than a p-value. Did you do something wrong? No, remember MS’s unofficial slogan:

For reasons I cannot fathom, Excel refuses to accept negative t-values, which is very strange since half of all t-values in the t-distribution are negative. It’s like a cashier that only accepts one, ten, and fifty dollar bills. Furthermore, the TDIST() function only gives the p-values for the entered t-value or larger. There’s no flag to pass to tell it you want the p-value for a given t-statistic or smaller.

But it’s not a bug, it’s a feature, because it forces you to draw a diagram of the sampling distribution in order to figure yourself out of this problem. When you do, you might see something surprising.

Okay, draw your sampling distribution. It’s a normalish curve with a standard deviation equal to your standard error (4.72, in this case). The midpoint is equal to your hypothetical difference (1.00 for Hypothetical State A). Mark your observed difference on it (-0.25). Below each number, mark the equivalent in t-units: 0 for the midpoint, 1 for the standard error, and -0.26 for the observed t.

Now shade the area that you’re calculating the probability for. For Hypothetical State A than’s everything at or greater than -0.25 all the way off the right tail of the distribution.

Whoa, is that right? More than half of the curve is shaded. The total probability for the entire curve is 1.00. That is, you have a probability of 1.00 (total certainty) of getting some difference that’s in the sampling distribution. That makes sense: the sampling distribution by definition includes every possible difference from every possible sample. But if most of the sampling distribution is shaded then the probability of getting a difference of -0.25 points given a population difference of 1.00 point is more than 0.50. Getting a difference of -0.25 or more is not only plausible, its *likely* given the real difference of 1.00. There is no way you’re going to get a p-value less than 0.10, or less than any reasonable chance anyone would want to have of making a Type I error. It’s not just an uphill battle. It’s an unscalable wall. This is going to be true whenever your range of values includes the hypothetical state.

**Step 3-and-a-half. Work around the limitations of your spreadsheet functions.** So, to be fair to Microsoft, you don’t need to do use TDIST() in this case. But let’s undo the problem Excel caused and get the exact p-value anyway so you know how to do it. To do this, we rely on the fact that the t-distribution is symmetrical. The p-value for -0.26 or greater is equal to the p-value for 0.26 or less. So the solution is to calculate the p-value for 0.26 or less.

Problem: the TDIST() doesn’t give the p-values for a t-or-less. Fine. We know the entire distribution totals to 1, so we get the p-value for 0.26 or more and then subtract it from 1.

So the p-value for -0.26 or more is equal to 1 – TDIST(ABS(-0.26),6,1), Sheesh. There really isn’t any good way to fix this other than going through sketching the sampling distribution and figuring it out.

TDIST() for a *tstat* of 0.26, *df* of 6 and 1 tail gives a p-value of 0.400. So the p-value for Hypothetical State A is 1 – 0.400 = 0.600. Just as we suspected, we’re likely to get a difference of -0.25 or more in the sample when the real difference is 1.00.

Okay, so we’ve nothing to statistically compel us to redesign the site. But does mean we should stick with the old design? Not by itself. It’s time to calculate the probability of getting -0.25 points if Hypothetical State B were true about the population. Now we have the complimentary logic: If our observed difference (or less) is implausible assuming Hypothetical State B is true, then we stay with the old design. That is, if there is less than a 0.10 probability of getting no more than -0.25 points for the Definitely Redesign state, then we conclude that the true difference in the population is lower than 3.00 points -it’s at least in the Whatever Zone (where, staying with the old design isn’t an appreciably harmful choice to Stan’s business), and may be at or below the Definitely Don’t Redesign level.

For Hypothetical State B, Step 1 is already done -the standard error isn’t affected by the hypothetical states. Step 2 is:

*tstat* = (-0.25 – 3.00) / 4.72 = -0.69

** **

For Step 3, sketch the sampling distribution:

We’ve another negative t, so we have to use the equivalent other side of the t-distribution to use the TDIST().

You see in this case we are getting the p-value for the equivalent t *or more*, so we *don’t* have to subtract the result from 1. The p-value is just TDIST(0.69,6,1), or 0.259.

Hmm. It seems that’s it *also* pretty plausible you’d get -0.25 points or less when the true population difference is 3.00. The p-value is higher than 0.10, so now you’ve nothing to compel you to keep the old design. It’s a no-win situation. If you proceed with the new design, you have a 60% chance of making a Type I error and redesigning when you definitely shouldn’t. If you keep the old design then you have almost 26% chance of making a Type II error and not redesigning when you should. Basically you can’t make a reasonably safe design decision one way or the other. Dude, both of your options suck. You’re bobbing in statistical mush going nowhere.

### Addressing Ambiguous Results

#### More Design

In this particular case, you may want to offer to Stan that you work to improve the new design for no additional cost to him. Given your observed difference in the averages is within a standard error of Hypothetical State A, you know it’s quite plausible this -0.25 you’re seeing is sampling error. However, the fact that the users rated the new design as worse than the old means it’s more likely the new design really is worse than the old than vice versa. It may not be *much* more likely, but it’s still somewhat more likely. Maybe you should make the design better before subjecting it to more summative testing.

On the other hand, if the new design already represents your best effort, and there’s no hints from the usability test on improving the new design, maybe you’re business is better off testing more users now than wasting time blindly trying to improve what you have. If the new data forces you to conclude the new design is worse than the old, *then* you can offer to try to improve the design some more. Or you can walk away. Maybe this just isn’t the project for you.

This course of action only applies when the new design appears to be doing much worse than intended -like here where it’s doing worse than the old design. If the new design were doing, say, 3 points *better* on average than the old, you’d still be getting pretty big p-values given the hypothetical states (go ahead and calculate them assuming the same standard error), but it would be encouraging -the sample’s performance would be right on the Definitely Worth It threshold. It would seem you only need a bigger sample size to be convinced it’s real.

#### More Data

That’s the other course of action. You could collect more data. We saw in Stat 101 how larger sample sizes mean smaller Type I and Type II error rates for a given level of user performance. That’s the whole reason why you prefer large sample sizes when you can afford them. Larger sample sizes increase what I call the *decision potential* of your usability test -your ability to make a decision at acceptable Type I and Type II error rates. Statisticians loosely talk about larger sample sizes increasing statistical power (that is one minus the Type II error rate), but that’s only because in scientific work the Type I error rate is traditionally fixed at 0.05, so increasing the sample size will only change the Type II error rate. However, emphasizing only the impact on power masks the fact that bigger sample sizes also helps you make decisions regarding Hypothetical State A as well as Hypothetical State B. For a given Type I error rate (or level of statistical significance), bigger samples mean you can confidently decide to redesign with a sample user performance closer the Hypothetical State A. They also mean you can confidently decide *not* to redesign with a sample performance closer to Hypothetical State B. Bigger samples give you a smaller in-between range of performance when you can’t decide either way.

You can see the role of sample size on decision potential in the t-statistic formula. The t-statistic represents the deviation of your observation from the hypothetical state in units of the standard error. To get a lower p-value for either hypothetical state, you need a larger t-statistic -a greater deviation. Looking at the formula for the t-statistic, you see you can’t do anything to increase the numerator to get a bigger t-statistic, at least, not anything ethical. You can hope that collecting more data will shift the difference in averages to something more favorable for your design, but you can’t make it happen. It’s going to be what it’s going to be (Whoa, says Stan, statistics are *deep*, bro).

But you can do something to shrink the denominator and thus get a bigger t-statistic. The formula for the standard error shows that gathering more data directly reduces it. Make either or both group sizes bigger (bigger *n1* and/or *n2*), and the standard error gets smaller. That’s just another way of saying what you already know intuitively: the bigger the sample the more accurate the statistics that come from it -the less your observed difference in the averages will tend to deviate from the real one. Since the t-test for Hypothetical States A and B use the sample standard error, a smaller standard error increases the t-statistic for both. Bigger t-statistics mean greater decision potential.

### Estimating Needed Sample Size

To summarize: bigger sample size means smaller standard error means bigger t statistics mean smaller p-values mean you make a design decision at acceptable Type I and Type II error rates. Not only do you know that increasing the sample size will get you to a design decision, but, because you have standard deviations for each group, you can estimate *how much bigger *your sample needs to be. Surfer Stan had told you that he prefers his Type I and Type II error rates be kept around 0.10. Play with Excel’s TDIST() function a little, and you’ll find you need a t-statistic of about 1.3 to get a p-value of 0.10 with larger sample sizes (and therefore more degrees of freedom). A little algebra tells us:

*tstat* = (*diff* – *hypo*) / *se*

*needed se = *(*diff – hypo*)* / tstat*

In the case of Surfer Stan, and Hypothetical State B:

*needed se = *(-0.25 – 3.00) / 1.3 = 2.50

That is, assuming the difference in the averages remains the same at -0.25, you’ll be able to conclude the new design is not considerably better than the old if you can get the standard error down from 4.72 to 2.50 -you need to cut the standard error almost in half. Normally, you can use the same formula for Hypothetical State A to see how much smaller the standard error has to be to conclude the new design is better than the “who cares?” level of performance, but in this case, with the new design performing worse than the old, there’s no such number -when your shaded region of the sampling distribution passes through the hypothetical value, your p-values will always be greater than 0.500.

Now let’s figure out how many more users you need to run to reduce the standard error to 2.50. Assuming you have equal numbers of users trying each design (and it’s generally a good idea to try to accomplish that) then the standard error is inversely proportional to the square root of your sample size. Algebraically:

*se* = SQRT(*s1*^2/*n1* + *s2*^2/*n1*)

Given *n1* = *n2*, we’ll just call the sample size of each group *ng*, so *ng* = *n1* = *n2*:

*se* = SQRT(*s1*^2/*ng* + *s2*^2/*ng*)

*se* = SQRT( (*s1*^2* * + *s2*^2)/*ng *)

*se* = SQRT(*s1*^2* * + *s2*^2) / SQRT(*ng*)

* *

If the size of the standard error is inversely related to the square root of your sample size, then the ratio of your current standard error over your needed standard error is equal to the ratio of the square root of your needed sample size over the square root of your current sample size:. Mathematically:

(*se current*) / *se needed*) = (SQRT(*ng needed*) / SQRT(*ng current*))

More algebra:

*ng needed = *(*se current*) / *se needed*)^2 * *ng current** *

You currently have 4 users in each group (*ng* = 4). Given a standard error of 4.72, and a needed standard error of 2.50:

*ng needed = *(4.72 / 2.50)^2 * 4 = 14.28

Conclusions:

- You need an estimated 14 users (rounded) per group, which means,
- Since you have 4 users so far, you need 14 – 4 = 10 more users per group, which means,
- You need to find a total of 2 * 10 = 20 more users, which means,
- You’ve at lot more usability testing to do, which means,
- You can forget about spending two days of this business trip lying on the beach, conducting “ethnographic research.”

Going from a sample of eight to 28 seems like a lot more work than should be necessary (especially just to convince yourself the new design won’t be substantially better than the old). However, remember the standard error varies with the inverse of the *square root* of the sample size. If you need to cut the standard error in half, you need to *quadruple* your sample size. Even then, running 20 more users does not guarantee you’ll be able to make a go/no-go decision. It’s an estimate assuming the difference between the averages remains at -0.25 and the standard deviations remain the same. They probably won’t, of course, because there’s sampling error. But look at the bright side: running a nice big sample also gives you the best chance of finding out that the new design is actually better than the old, if it really is better. Text me when you’re done. I’ll be hitting the waves.

### The t-test Redux

A day or two later, here’s data from 20 more users appended onto the original data, along with the revised statistics:

Old Home Page | New Home Page | Both Pages | ||

Data | ||||

User | Rating | User | Rating | |

1 | 23 | 2 | 26 | |

3 | 20 | 4 | 29 | |

5 | 31 | 6 | 21 | |

7 | 16 | 8 | 13 | |

9 | 28 | 10 | 26 | |

11 | 20 | 12 | 21 | |

13 | 25 | 14 | 31 | |

15 | 17 | 16 | 29 | |

17 | 18 | 18 | 27 | |

19 | 21 | 20 | 26 | |

21 | 17 | 22 | 29 | |

23 | 16 | 24 | 33 | |

25 | 35 | 26 | 23 | |

27 | 15 | 28 | 29 | |

Statistics | ||||

Sample | 14 | 14 | 28 | |

Average | 21.57 | 25.93 | 23.75 | |

Std Dev | 6.14 | 5.11 | 5.97 | |

Skew | 1.05 | -1.23 | -0.02 | |

Difference in averages |
4.36 |

Now we’ve got a different story to tell: The new page scored 4.36 points better than the old. Let’s run a t-test on Hypothetical State A.

Step 1: Standard Error

*se* = SQRT(*s1*^2/*n1* + *s2*^2/*n2*)

* *

*se* = SQRT( 6.14^2 / 14 + 5.11^2 / 14 ) = 2.13

Cool. Due to small changes in the standard deviations, our standard error is a little lower than we hoped to get by adding 20 more users.

Step 2: t-statistic

*tstat* = (*diff* – *hypo*) / *se*

*tstat* = (4.36 – 1.00) / 2.13 = 1.57

With a bigger difference in the averages and a smaller standard error, it’s no surprise the t-statistic is substantially larger.

Step 3 through 3-and-a-half: p-value

We sketch our sampling distribution. We’ve a positive t-statistic and we we want to know the the probability of getting a difference of 4.36 or greater, so it looks like this:

For once, we don’t have to go through any contortions to use TDIST(). Degrees of freedom are now:

*df* = *n1* +* n2* – 2

*df* = 14 + 14 – 2 = 26

And so:

*p* = TDIST(1.57, 26, 1) = 0.064

The p-value is 0.064. It’s pretty implausible that the new design is insufficiently better than the old -maybe not the most implausible thing you’ve ever faced, but implausible enough that Stan believes giving you the contract for the entire new site is a good business decision with tolerable risk. It seems the initial result from the sample of eight users was just sampling error. Lucky.

You can run the t-test for Hypothetical State B if you want, but you should already be able to tell it’ll have a pretty big p-value since the observed difference is within one standard error (the new smaller standard error) of the hypothetical value. In fact, since the observed difference (4.36) is greater than the hypothetical state (3.00), and we’re testing for the probability of getting the observed difference or smaller, you know the p-value will be over 0.500. It’s a mirror image of the situation we had with the initial sample and Hypothetical State A. However, at this stage it doesn’t matter what the p-value is. If the p-value were very low, then you’d be pretty sure there isn’t a *definite* advantage with the new design, but you’d still be pretty sure it is at least better than the “who cares?” level of performance. You’d conclude you’re in the Whatever Zone, to which Stan would likely say, “whatever,” and give you the contract, especially after already sinking some money into the test home page.

So congratulations. On to more designing, more usability testing, and more statistics. Whoo!

### Summary Flow Chart

#### Problem: Comparing averages to determine if a certain design is truly better than another.

**Solution**: Follow the flow chart below. Consider it provisional, because there are other issues to address that we’ll cover in Stat 202.

- With your client, set your Hypothetical States A and B.
- Conduct a t-test for Hypothetical State A
- Calculate your standard error.
- Calculate the t-statistic.
- Determine the p-value with something like Excel’s TDIST() function.

- If the p-value is sufficiently low to represent a tolerable chance of a Type I error rate, proceed with the new design.
- If the p-value represents an excessive Type I error rate, conduct a t-test for Hypothetical State B.
- If the p-value represents a tolerable chance of a Type II error rate, do not proceed with the new design.
- If the p-value represents an excessive Type II error rate, calculate the increase in sample size you need to get a low p-value for either Hypothetical State A or B.
- Increase your sample size.
- Repeat

Here’s an Excel sheet with this post’s data and analysis.

### Update 4/14/2012

#### Precisely Calculating Degrees of Freedom

The simple formula for degrees of freedom, *df* = *n1* + *n2* – 2, is good enough for most usability testing situations. However, if there are big differences in standard deviations of your two groups of users, then you need to make complicated adjustment. As a rule of thumb, your should calculate the adjustment if one standard deviation is twice the size or larger than the other. For example, you’d make the adjustment if, after running 20 more users, the new home page had a standard deviation of, say, 12.00 while old home page remained at 6.14.

How complicated is the adjustment? Well, first let’s define the error variances, *v1* and *v2*, for each group as:

*v1* = *s1*^2 / *n1*

*v2* = *s2*^2 / *n2*

The adjusted degress of freedom are then:

*df* = (*v1* + *v2*)^2 / (*v1*^2 / ( *n1* – 1) + *v2*^2 / (*n2* – 1))

When using TDIST(), round the result to the nearest integer. TDIST() accepts only integers for the degrees of freedom and will truncate any decimal number, so rounding is more accurate.

#### Impact of Using the Adjusted Degrees of Freedom

When the group standard deviations are about the same, the results are essentially equal to what you get with the simple *df* formula. For example, with the standard deviations of 6.14 and 5.11 that we got with 28 users:

*v1* = 6.14^2 / 14 = 2.69* *

*v2* = 5.11^2 / 14 = 1.86

*df* = (2.69 + 1.86)^2 / (2.69^2 / ( 14 – 1) + 1.86^2 / (14 – 1)) = 25.17

We round 25.17 to 25, which, yields same p-value (rounded to three places) as we got with df = 26 earlier:

TDIST(1.57, 25, 1) = 0.064

A few degrees of freedom here or there don’t make much difference when the total sample size is about 30 or more.

However, if the standard deviations were very different, such as 6.14 and 12.00:

*v1* = 6.14^2 / 14 = 2.69* *

*v2* = 12.00^2 / 14 = 10.29

*df* = (2.69 + 10.29)^2 / (2.69^2 / ( 14 – 1) + 10.29^2 / (14 – 1)) = 19.36

Rounding to 19,

TDIST(1.57, 19, 1) = 0.066

Okay, it *still* doesn’t make much difference, but it’s good we played it safe and made the adjustment. The smaller sample sizes, the bigger the difference. For example if there were 4 per group, the p-value goes from 0.083 with simple formula to 0.095 for the complicated adjustment (the adjustment can only increase p-values).

So, you’re generally safe to use the simple formula for degrees of freedom in usability testing. On the other hand, you can do no wrong using the complicated adjustment. Just to be extra-anal, I’ve updated the spreadsheet with this post’s examples to use the adjustment.

#### The Significance of Very Different Standard Deviations

Parenthetically, a large difference in your standard deviations may itself be a signficant finding, in both senses of the word. For a simple way to get the p-value for the differences of two sample standard deviations, read up on the F-test. A higher standard deviation in one group indicates those users are comparitively polarized. For example, if the new home page had a standard deviation of 12.00, it would suggest that, relative to the old home page, users tended to either love or hate the new home page. That could have design or deployment implications.

#### What’s This All About, Anyway?

If you’re leafing through your intro stat textbook trying to figure out where all this is coming from, then it’s this: the procedure I’ve outlined in this post is a “t-test for separate variance estimates,” in contrast to the “t-test for a pooled varance estimate” that most textbooks present. The t-test for a pooled variance estimate assumes the two groups have the same population standard deviation and any difference you see in the sample standard deviations is sampling error. You then estimate the population standard deviation with essentially a weighted average of the two sample standard deviations.

However, my philosophy is don’t assume anything you don’t have to assume. Usually, it’s pretty reasonable to assume the two groups have the same population standard deviation, but why take the chance? The t-test with separate variance estimates is always the safe option. It gives the same p-value as the t-test for a pooled variance estimate when your two groups have the same number of users and there’s little difference in the sample standard deviations, both which are usually true in usability testing. However, it protects you in case any differences in the sample standard deviations are, in fact, reflected in the population. So you lose nothing but gain peace of mind by using the the t-test for separate variance estimates. The only drawback is the more complicated adjusted degrees of freedom calculation, which is important to do when sample standard deviations are *very* different and sample sizes are small.

#### No, What’s This *Really* All About?

I forgot about adjusting the degrees of freedom. Hey, you can’t believe everything on the web.