Comparing Averages, Part 2
- Comparing two new proposed designs.
- Two-tailed versus one-tailed t-tests.
- Maximizing power with experimental design.
- Within-subject testing.
- More on the normality assumption.
- Data transformations.
Yes! You got the massive Surfer Stan web site contract. Now you’re trukin.
The biggest challenge for the site redesign is Paint Your Board, a new feature for Stan’s site. Stan has invested in a computer numerically controlled (CNC) four-color surfboard painter. Stan got it so users can upload any image they want to appear on the boards they’ve chosen for purchase. You’ve made two interactive prototype UIs for Paint Your Board. One is an easy-to-understand but awkward-to-use multi-page HTML wizard, where users select the image’s size, position, orientation, and distortion on their boards through a combination of input fields and Submit buttons. Early testing showed this requires a lot of trial-and-error and page loads to get the results the users desire. It’s all very time-consuming.
The second prototype is a Flash application where a user drag-and-drops the chosen images over the board and sees the result instantly. In theory this will be several times faster than the HTML approach (not to mention being way rad). However, you know that drag and drop isn’t terribly discoverable, so the question is, will the improved speed of Flash make up for the time it takes to learn it? Time for another usability test. This time, the question is, which is faster on average, HTML or Flash?
City of Two Tails
Hypothetical State A
Let’s select our Hypothetical State A. What true difference in times is definitely not worth favoring one design over another? This is a different situation than last time when comparing an existing home page with a proposed new one. Before you already had a working web site, and were trying to see if it was worth the cost of redesigning a new site. Now both designs are equally complete and at least partially operational (to do the usability test, you had to prototype both designs). However hard it was to make each version, that money and time is already spent. Assuming the future differences in development, installation, and maintenance cost is negligible, there is no additional cost associated with going with one design over the other. So any difference is worth favoring one design over another. You have to choose one of them. If you can conclude that one is even a tiny smidgen better than the other, then that’s the one to go for. So:
Hypothetical State A: The difference in the completion times is zero.
If we observe a difference that is implausible assuming the real difference is zero then we can confidently favor one design over the other. Significantly (no pun intended), we will favor one design over the other for any implausible difference, either HTML being better than Flash or vice versa.
Last time with the home page designs, we only considered if the new home page was better than the old home page. We never checked to see if the old design was sufficiently better than the new design. We calculated p-values for Hypothetical State B to see how sure we could be that the new design was not considerably better than the old design, but that didn’t tell us if the old design could be considerably better than the new. It made sense to only consider the new being better than the old because it was really irrelevant if the old design is better than the new one. I mean, who cares? As long as you’re reasonably sure the new design wasn’t considerably better than the old, you were not going to go with the new design. Concluding that the old design is better than the new didn’t add useful information to your decision-making process.
Now we are considering either difference, simultaneously both HTML being better than Flash and Flash being better than HTML. When we calculate the p-value, it’s not a question of the probability of getting an observed difference so far above Hypothetical State A. It’s not a question of probability of getting an observed difference so far below Hypothetical State A. We need the probability of getting an observed difference so far from either side of Hypothetical State A. We need the probability of a difference in completion times as extreme or more extreme as what we observe.
For example, suppose we get a difference in completion time of 50 seconds. Let’s say the standard error is 25 seconds, Our p-value for that t-statistic should represent the probability of getting both 50 seconds or higher above zero, and 50 seconds or lower below zero (that is, -50 seconds or lower), The t-statistic, which converts our deviation from the hypothetical state from seconds to number of standard errors, is:
tstat = (diff – hypo) / se
Hypothetical State A is zero difference, so the corresponding t-statistics would be:
tstat = (50 – 0) / 25 = 2.0 for 50 seconds and higher,
tstat = (-50 – 0) /25 = -2.0 for 50 seconds and lower.
So in units of standard error, we want the p-value of getting 2.0 or higher and -2.0 or lower. The sampling distribution would look like this:
This is called a two-tailed test because there are time-differences in each tail of the sampling distribution that would represent Type I errors that we need to include in our probability. We say we’ll favor one design if the difference is implausible assuming Hypothetical State A is true, but that really represents two ways to make a Type I error -two ways to favor a design when there is really no difference, so we have to include the probabilities of both extremes to get the total probability of getting a Type I error. We do this by simply adding the probabilities from both tails together. We’ll add the probability of getting a difference of 2.0 or more with the probability of getting a difference of -2.0 or less. Or, because the t-distribution is symmetrical, we’ll double the probability of getting 2.0 or more (especially since Excel refuses to take negative t-statistics).
We can get the p-value of out t-statistic using Excel’s TDIST() function, which requires the t-statistic, the number of tails, and the degrees of freedom. Let’s say our degrees of freedom (which are directly related to the sample size) is 10. TDIST(2.0, 10, 1) = 0.037 is the probability in one tail, so the probability in two tails is 0.037 * 2 = 0.073.
Or, we simply tell Excel we want to do a 2-tailed t-test, and Excel will double the p-value for us: TDIST(2.0, 10, 2) = 0.073.
Hypothetical State B
You may wonder if we even need Hypothetical State B since two-tailed p-values for Hypothetical State A can tell us both if we should choose Flash over HTML or if we should choose HTML over Flash. We still need Hypothetical State B so we can tell with high certainty if there is no considerable difference between Flash and HTML, as opposed to failing to detect a potentially important difference because your sample size is too small. By “considerable” in this case, I mean so large we should definitely choose one design over the other.
So now you and Stan set Hypothetical State B. What true difference in times is definitely worth favoring one design over another? Talking it over with Stan, you decide a 20% shorter completion time or better. Or to put it another way, if the faster version takes less than 80% of the time of the other, it’s definitely the design to go for. For example, if using one design takes five minutes and the other design takes four minutes.
Stan’s reasoning is that 20% less time to do something is enough to be noticeable, so that’s definitely worth favoring one design over the other. You could’ve agree on a certain number of minutes or seconds, but, not knowing how long the Paint Your Board tasks takes (it being a new feature), Stan couldn’t get tight with such numbers. Besides, having Hypothetical State B as a ratio is going to save us some arithmetic complications, as it turns out.
We have two-tails for Hypothetical State B too, since we need to calculate the p-value of any difference assuming the real difference is 20%; that is, HTML-over-Flash or Flash-over-HTML being 80%.
When You Need a Two-Tailed Test
Doing a two-tailed t-test has nothing to do with completion time being our measure. You can do a two-tailed test for ratings or any other measure.
What matters is whether you’re preparing to check for one or two possible outcomes from each hypothetical state. If you are strictly interested in the possibility of Design X being better than Design Y, it’s a one-tailed test. If you are even remotely interested in Design Y being better than Design X in addition to Design X being better than Design Y, then it’s a two-tailed test.
It’s important to firmly decide whether you want to do a one-tailed or two-tailed test because each has advantages and disadvantages, and you can’t change your mind after you see how the users perform.
The advantage of the two-tailed test is that it tells you more information -for Hypothetical State A alone, it’ll tell you if the X is better than Y and if Y is better than X. The one-tailed procedure can only tell you if X is better than Y. However, the one-tailed test has more power or “decision potential,” as I like to say. You’re focusing all your resources into only one possible outcome per Hypothetical State, getting more out of it. Because a one-tailed procedure doesn’t split our Type I and Type II error probabilities each into two pieces, you can make conclusions with less extreme t-statistics. For example, with a one-tailed test you need a t-statistic of about 1.3 (the exact value depending on your degrees of freedom) to have a 0.10 probability of an error. With a two-tailed test, you’d need a t-statistic of about 1.7 to have the same probability.
A less extreme t-statistic can come from a smaller sample sizes (i.e., bigger standard errors), so you can make decisions with faster and cheaper usability testing.
Given that usability tests already tend to have small sample sizes, the greater decision potential of a one-tailed test is a major advantage. However, you have to decide to use a one-tailed test before you see which design is on average better than the other in your sample. If you wait and see that Flash is faster than HTML, then decide, “well, I’ll do a one-tailed test on Flash being better than HTML,” you’re really doing a two-tailed test with double your apparent Type I error rate, because if HTML came out faster than Flash you would have done precisely the opposite. It’s like betting “heads” if a coin-toss comes up tails, and betting “tails” if it comes up heads. You’re doubling your chance of losing from 0.5 to 1.0.
Likewise, you can’t start out planning a one-tailed test then switch over to two-tailed once you see the data going opposite of your one-tailed test. Given Stan can tolerate a 0.10 chance of a error, switching from a one-tailed to a two-tailed after seeing the averatges means having a 0.10 Type I error rate if the data had gone as expected, plus a 0.05 Type I error rate for how the data actually went, for a 0.15 total Type I error rate. Bummer for Stan.
The bottom-line is use the one-tailed procedure when there is some added cost associated with going with Design X over Design Y, such as we had in the rating scale situation, where going with Design X (the new one) meant Stan spending duckets he wouldn’t have to otherwise. In other words, use a one-tailed test when Design Y being better than Design X is practically equivalent to Design X not being considerably better than Design Y.
Decision Potential in Usability Testing
It’s a pretty straight-forward usability test: get the average difference in time it takes to virtually paint a board with each prototype. However, you learned your lesson from the test with the ratings. You don’t want to once again be in a situation where both hypothetical states were reasonably plausible given your initial small sample size.
That meant you couldn’t make any design decision with adequate confidence. We want to maximize our decision potential (or statistical power, broadly speaking) to stay out of the statistical mush -of being unable to make a decision at acceptable Type I and Type II error rates. For a given usability test, we want to have the lowest Type I and Type II error rates regardless of the difference we see in the sample. Going back through Stat 101 and 201, we’ve already touched on three ways to help decision potential:
Large sample size. The larger the sample sizes the more accurate the statistics from the sample; that is, the smaller their sample error. The smaller the sample error, the higher your confidence that any deviation from either hypothetical state represents the true population values. In fact, as we saw in Stat 201, if we know the averages and standard deviations to expect from the sample, we can calculate the sample size we need to have reasonably good decision potential.
In the case of these completion times, you don’t have estimates of the standard deviations so you can’t estimate the sample size you’ll need. Earlier testing sessions of previous design iterations included a lot of stopping the task to talk with the users to get qualitative feedback, so it didn’t make sense to try to time the users. But here are some guidelines on choosing a sample size when you don’t have those statistics: Sample sizes of 5 to 10 users are generally only adequate when you expect one design to blow the other out of the water -like one design having half the completion time as the other, a result you can see coming in earlier design iterations even allowing for imperfect measurement. However, your sense from this formative testing is that the difference between HTML and Flash is going to be close. Alternatively, small sample sizes are adequate when there is very little difference between the two designs and your Hypothetical State B specifies a radical difference in performance. If you’d need one design to take half as long as the other to be definitely worth it, then 5-to-10 users would be a good starting point. You and Stan, however, see 80% to be definitely worth it -not that big of a difference in completion time. For our example, go big: start with 20 users and be prepared to add more users if you get mushed.
High-powered statistical test. Some statistical procedures have more decision potential than others. As a general guide, the more information you use from the data and testing process, the greater the decision potential. For comparing averages, the t-test has just about the most decision potential of any test you’re going to do on a spreadsheet, so we’re already doing the best we can in this regard. Simple alternatives to the t-test use less information in the data. For example, tests like the Mann-Whitney U and Wilcoxon Signed-rank use the rank order of the completion times rather than the actual completion time to the nearest second. Other tests, like Chi-square or the sign test simply classify the completion times as relatively “slower,” and “faster.” Anytime you reduce your data in this way, you’re losing decision potential. The main reason to consider such alternatives to the t-test is if either the data is already reduced (e.g., that’s how you got it from the client), or if the normality assumption cannot be met.
One-tailed Test. As we just covered above, a one-tailed test has greater decision potential than a two-tailed test. However, we’re interested in either Paint Your Board design being faster than the other (or either board not being considerably faster than the other), so we’re doing a two-tailed test, which makes it all the more important to compensate the loss in decision potential.
Let’s look more closely at the t-test and see if there something else we can do. To get high decision potential, you need large t-statistics for either Hypothetical State A or B. The numerator for the t-statistic represents the deviation of the sample from the hypothetical state, and that’s going to be what that’s going to be. The denominator of the t-statistic is the standard error. You’ve got to keep your standard error down. Big sample sizes do that, of course, but they’re expensive and time consuming so you don’t want them to be any bigger than they have to be -you remember quite well how you missed a couple beautiful days at the beach because you had to nearly quadruple your sample size for the ratings test in Stat 201. What else can you do?
The only variable other than sample size in the formula for the standard error is the sample standard deviations. You need to make them as small as you can. You want to minimize the variability in the numbers you conduct your t-test on. There are several ways of doing this:
More similar users. If all your users had the same level of skill and knowledge, that would reduce variability. This is why scientists do experiments on genetically inbred white rats raised in controlled environments -they’re all very much alike in whatever rat talents and skills they have, so that minimizes variability, allowing smaller sample sizes. However, limiting the range of skill and knowledge of you users would make your sample unrepresentative of your population of users. Heck, run a bunch of white rats on your Paint Your Board feature, and they’ll all do equally well (poorly), but it won’t mean anything for your users. What you can do is more carefully select your users to make sure you aren’t getting any who are truly unrepresentative of your users. For example, you want to be sure not to include any total kooks who keep confusing the top of the surfboard from the bottom (”which way do the fins go again?”), adding to their task completion time.
More tightly controlled task. You get less variability if you focus your usability test task on precisely the thing you’re testing, and exclude all else. For example, the UI to upload the image for the board is the same for both the HTML and Flash version, so don’t start the stopwatch until after the users upload their image. Otherwise users who get lost in their own file hierarchies have slow completion times relative to those who rip through their folders. Including the image-uploading won’t change the difference in the average completion times because by random chance you can expect equal number of hierarchy-lost users in each group, so they cancel out. However, it increases the variability of the completion times. Likewise you want to eliminate other sources of extraneous variation. Make sure your instructions are clear and consistent for all users so you don’t have any taking unusually long times because they misunderstood something. Make sure there aren’t any interruptions once the task is started. Keep the users on task no matter how much they want to wander off and explore other features of the site.
Within Subjects Testing. You substantially reduce variability by having each user try both designs and look at the change in completion time of each user. This is call a “within subjects” test (or a “repeated measures” experiment) because you’re varying the UI design within each user or “subject.”. The alternative is “between-subjects” testing, which is what we had with the ratings in Stat 201. There you varied the UI design between two separate groups of users. Within-subjects testing has the effect of factoring out individual average differences, reducing variability. Users who tend to be fast on both prototypes are no longer so different from users who are slow on both prototypes because we now use the change in completion times.
Using Within-subject Testing
Within-subject testing is great for increasing decision potential, but it has a couple disadvantages. First of all you need more time per user, so you probably have to compensate them more. However, that’s usually cheaper than having to recruit more users, given the overhead of identifying suitable ones. As long as it doesn’t make the usability test so long no one wants to do it, that isn’t a major problem. Usually, you’ll get more decision potential from a within-subject test of a certain number of users than you would with a between-subjects test of twice as many users. Both approaches have the same total amount of user time, so the compensation cost is about the same, but within-subject gives you more decision potential for the compensation buck. Thus, even if the only cost were paying users for their time, you’re still usually better off with a within-subjects test.
A more serious problem is order effects. There might be differences in performance merely due to users trying one prototype before the other. For example if the users try the HTML version first then the Flash version, they might be faster with Flash simply because they got some practice painting their board, not because Flash is better. Or if the HTML version were faster, then maybe users were getting fatigued or bored by the time they got to paint their board (again) with the Flash version. The solution is to balance out the order effects by randomly assigning exactly half of your users to try the Flash version first while assigning the other half to try the HTML version first. This is called cross-over within-subject testing.
Within-subject testing should be considered for all usability tests regardless of the measure. We could’ve done it for the test comparing ratings of the new and old home page (now I tell you). That not only would’ve lowered the standard error, but also would’ve encouraged users to make contrasting ratings for the two versions if they believed one was better than the other. As always, you want to balance out the order effects by having exactly half your users seeing one version first while the other half sees the other first. Who knows? There may be primacy preference effects where users tend to like the first thing they see the most. Or maybe recency effects where they assume the second thing they see must be “improved.” A cross-over test eliminates such concerns, pre-empting any tedious (but legitimate) arguments over data interpretation.
The only time to not use within-subjects is when you’re likely to get extreme order effects -when doing the task the first time effectively negates the meaning of doing the task again. For example, say you’re testing two menu designs to see how they communicate a web site’s information architecture. Users are going to at least partially learn the IA with the first menu design they use, so they’ll be much better when they try the second one. In fact there may not be much new learning of the IA at all so it’s no longer reasonable to compare the two.
Finally we get to run the usability test. Here’s your data:
Two of your users were “No Shows” who didn’t arrive for the usability test, so you have 18 rather than 20 users. With a within-subjects test, we’re only interested in the statistics from the difference of the completion times (in seconds) for each user. In this example, we’ll arbitrarily make each user’s difference be the Flash design completion time minus the HTML completion time, so a negative number means that a user was faster with Flash and a positive number means a user was faster with HTML (it won’t make a difference if it were HTML minus Flash as long as you keep track what positive and negative mean).
With a negative average of the difference scores, we see that on average most users are faster with Flash, but only by 13.7 seconds. With within-subjects testing the standard error is still derived from the sample standard deviation, but purely from the standard deviation of the difference scores. The formula for within-subjects testing is:
se = sd / SQRT(n)
Where sd is the standard deviation of the differences and n is the number of users (or number of difference scores). With the data above:
se = 80.3 / SQRT(18) = 18.94
Without even doing the t-tests we have a sense where things are going: The average difference of -13.7 seconds is well within one standard error of 0 (our Hypothetical State A), so our data is consistent with no difference between Flash and HTML. We should expect it to be pretty plausible that we get -13.7 seconds assuming Hypothetical State A is true. But we’ll still do the math.
As always, we also take a look at the distribution of the sample. It’s more-or-less bell-shaped, but it’s lopsided. Looking at the completion times for either HTML or Flash, we see most numbers are bunched up at the relatively low end while there’re relatively few at the high end. Most of your numbers are below average, which sounds like Lake Wobegon in some dark parallel universe, but its perfectly possible mathematically, and happens to be true in this case. The difference scores are not so badly lopsided, but I wouldn’t trust the distribution you get from a sample of difference scores.
This is called skewed data, and it’s common for the times to complete an action. It makes sense: there’s a limit to how fast a user can go -they certainly can’t complete a task in less than zero seconds, but there is no limit to how slow they can go. The effect on the differences in completion times, which is what we really care about, is not entirely predictable. Data may end up lopsided one way or the other, or not at all.
You can get a sense of skewness by just eyeballing how the data is distributed around the average, but there are also formulas that quantify skewness, which can be easier especially with large data sets. Using Excel’s SKEW() function, our data has a skewness of 1.11 overall, which is pretty high. A zero would be a perfectly symmetrical distribution, and negative skew would means skewness in the opposite direction, with sparse data stretched out below the average. Generally, skewness between -1 and 1 is mild.
But that’s not a big deal, is it? After all, the t-test only requires that the sampling distribution be normal, not the data distribution. The statistical high priests tell us that the sampling distribution will tend to be normal. Specifically, it will be close enough to normal if the sample sizes are large enough.
We’ve a pretty decent sample size -eighteen -so is that large enough?
Well, it depends on the shape of the distribution of data. In particular, skewness is the killer, where the greater the skewness in the data, the larger the sample has to be, blah, blah, but the short answer is no.
No. Seriously. Don’t blindly do a t-test on completion times or reaction times from a usability test. The data is generally too skewed and the samples are generally too small to get an accurate answer from a t-test. You want more like 30 data points per design before you can stop worrying about the skewness commonly seen in completion times.
But I wasn’t just wasting your time. You can do a t-test on completion times or reaction times, but you must first transform the times. Yes, Igor! Transform my data! BWAHAHAHAHAHA!
Okay, what the hell is transforming data? It simply means mathematically manipulating each data point in the same way. In our case, we want to manipulate the data so that the distribution has little or no skewness which is what’s keeping us from doing a t-test.
The Log Transform
The transformation that usually works for reaction and completion times is to take the logarithm of each time. Base 10 or natural logarithm, it doesn’t matter, but, out of habit, I use the natural logarithm (the LN() function in Excel). Here’s our natural logarithm transformed data:
The log transformation pulls in the right-hand tail of a distribution making positively skewed distributions more symmetrical. Note the skewness is pretty much gone, both from the times for each version and the difference in the times. We have nearly equal number of datapoints on each side the mean. The difference between the averages is now -0.063 instead of -13.7, but of course it’s still showing the Flash version is faster than the HTML version. But now we can do a t-test and figure the probability of getting a difference of -0.063 with a sample of eighteen users.
When to Transform
There are no hard and fast rules for when to transform or not. Completion times typically are skewed, but don’t have to be. Skewness tends to be low for long, multi-step routine tasks with trained users, as long as none leave the computer to catch some waves in the middle of the task. Furthermore, what we really care about is the skewness of the differences in the times, which may or may not be skewed even when the source times data are skewed. However, the skewness you see is itself based on a small sample and can be off by some amount from the true amount of skewness in the population. Skewness can go all over the place for the differences in completion times, and what you see in the sample is not always the best indication of what’s in the population. I wouldn’t trust it.
As a rule of thumb for sample sizes typical in usability tests (i.e., 10 or less data points per design), you should generally assume completion times are skewed unless:
- There is very little positive skewness in the sample (say under 0.50), or,
- You have prior data or other information that indicates there is no skew.
For intermediate sample sizes of 10 to 30, I would suggest focusing on the skewness of the completion times, not the differences of the completion times. If the raw completion times are (close enough to) normal, then the differences in the completion times will be normal too. If doing the transformation reduces the skewness (makes it closer to zero) then it’s probably worth doing.
For over 30, you’re probably okay without any data transformation of completion time data (where skewness is generally less than 2). Let the Central Limit Theorem to its job.
The main reason not to do a transformation is that it changes a bit what you’re testing. Generally the average of the transformed data is not equal to the transformation of the average of the data. In our example, LN(-13.7) does not equal -0.063. A t-test on transformed data means you are no longer comparing arithmetic averages. Rather, you are comparing some other indication of the central tendency. That can make it harder to explain your results to your client.
Fortunately, in the case of the log transformation, you are effectively comparing the geometric averages. While the arithematic average is all numbers added together and divided by n, the geometric average is all numbers multiplied together taken to the nth root. Sort of an average worthy of a multi-gigahertz processor. This is fortunate because, for completion times, it seems the geometric average is generally a better indication of the central tendency than the arithematic average.
To get the geometric average from our averages of the transformed data, reverse the transformation of the averages:
EXP(average of LN-transformed data) = geometric average
EXP(5.66) = 287 seconds = geometric average of HTML
EXP(5.60) = 270 seconds = geometric average of Flash
The Arcsine Transformation
Rating scales, like I said, tend not to be skewed much, so transformations are usually not necessary. However, you can get skewness -indeed you should expect skewness -if you are getting any floor or ceiling effects, where scores tend to be bunched up at one or the other extreme of the scale. For example, if our rating scale from Stat 201 averaged about 30 for its possible range of 5 to 35, meaning on average users were giving 6’s on the 7-point items, I would expect skewness. It’s as if users were pushed up against the ceiling of 35 points. Some apparently felt that “Agree” and “Strongly Agree” were not sufficiently strong for their level of agreement. They needed “Totally Agree,” “Massively Agree” and “F-ing Epicly Agree” beyond it. If you get skewed ratings due to floor or ceiling effects, you can try the following transformation to reduce skewness:
- Convert the user’s scores to the proportion of total possible range of points, so that 0 corresponds to the lowest possible score and 1 corresponds to the highest possible score. For example, with our scale that goes from 5 to 35 points, a 30-point score becomes (30 – 5)/(35 – 5) or 0.833.
- Take the arc sine (in Excel the ASIN() function) of the proportion. So a proportion of 0.833 becomes 0.985 radians (Excel by default gives arc sines in radians; it doesn’t matter if you use radians or degrees since they’re arbitrary units for a rating)
The above transformation has the effect of reducing floor and ceiling effects, stretching what otherwise would be compressed differences at the extreme ends of the scale.
The t-test follows the same steps as we saw in Stat 201:
Step 1. Calculate your standard error. For the transformed data, it’s:
se = 0.290 / SQRT(18) = 0.0683
Step 2. Calculate your sample t-statistic, which is still the deviation of your observation from the hypothetical state in units of the standard error. We’re dealing with transformed data now, but that doesn’t change Hypothetical State A. If the real difference in the un-transformed data is zero, then the difference in the log-transformed data will also be zero, assuming the data for each design have the same distribution as well as the same average. So we’ll still calculate the probability of seeing the -0.063 difference in the transformed data assuming the real difference is 0.
So the sample t-statistic is:
tstat = (diff – hypo) / se
tstat = (-0.063 – 0) / 0.0683 = -0.92
We draw our sampling distribution, as always, putting most of the distribution within a standard error of the average.:
Step 3 through 3 and a half. Get the p-value, working your way around Excel. For a within-subjects t-test -or any t-test on a single column of numbers like we’re doing here -the degrees of freedom is your sample size minus 1:
df = n – 1
df = 18 – 1 = 17
Since we’re doing a two-tailed test always enter the positive tail into Excel so it won’t trip out and give you an error, and let Excel double it for you to include the other tail.
p = TDIST(ABS(-0.92), 17, 2) = 0.373
No surprise here: it looks like it’s reasonably plausible that we’d see a difference as extreme as 0.063 in a sample size of eighteen when there is in fact no difference in the population. That’s just as we suspected when we calculated the standard error for the untransformed data. Properly transforming is not going to magically insert more certainty into your data; it’s just going to shift the data around as a group to make the statistical test work more accurately.
At this stage, however, we don’t know the likelihood of seeing this difference when there is in fact a considerable difference in the population. Time to test Hypothetical State B.
Step 1. Calculate your standard error. That doesn’t change when we go to Hypothetical State B:
se = 0.0683
Step 2. Calculate your observed t-statistic. Now we have to be careful. For untransformed data, Hypothetical State B was that the faster design takes 80% of the time as the slower design; in other words, the following ratio is true:
faster/slower = 0.80
However, taking the log transform changes this equation because the ratio of two numbers is not equal to the ratio of the logarithms of the same two numbers:
faster/slower <> Ln(faster)/Ln(slower)
So it would be incorrect to get the probability of seeing a difference of -0.063 assuming the ratio of the transformed data is 0.80. We need to convert Hypothetical State B into it’s equivalent with log transformed data. To do that, we take advantage of the following mathematical relation:
Ln(x) – Ln(y) = Ln(x/y)
Which means for Hypothetical State B to be true:
Ln(faster) – Ln(slower) = Ln(faster/slower) = Ln(0.80) = -0.223
In other words, Hypothetical State B states that the difference in the log-transformed completion times is -0.223. We’re doing a two tailed test, so we want to know the probability of getting a difference as extreme as 0.063 (positive or negative) given a real difference of 0.223 (positive or negative respectively).
So our t-statistic for Hypothetical State B is:
tstat = (-0.063 – -0.223) / 0.0683 = 2.35
Be sure the sign of your hypothetical state (positive or negative) matches the sign of your observed difference to allow for the fact that this is a two-tailed test.
Here’s the sampling distribution:
Step 3 through 3 and a half. Get the p-value.
Just another two-tailed test; the degrees of freedom haven’t changed:
p = TDIST(ABS(2.35), 17, 2) = 0.031
So it is a pretty safe bet that there is no considerable difference between the two designs -no good reason to believe one is definitely worth favoring over the other.
When Two Designs are about the Same
The good news is that your efforts to minimize the standard error appear to have paid off. With a single test of 18 users you can assert with high confidence that neither design is considerable faster to use than the other. The bad news is that you still have to choose what goes in the final web site, and the completion times are not going to help you much in this regard. Collecting more completion time data is not going to help here. Unless you’ve been extraordinarily unlucky, then neither Paint Your Board design takes less than 80% of the time as the other. Running more tests is just going to help you narrow down the range -maybe one takes less than 90% or 95% of the time, which is all in the “whatever” zone, so why bother?
You need to make a design decision and if completion times won’t help, you need something else to tell if one is even slightly better than the other. There are several ways you and Stan can proceed.
Give Them Both?
You may be tempted to put both designs up on the web site and make it a user option. After all, you went through the trouble of prototyping both -you’d hate to throw one away. Looking through your data, you see some users were faster with HTML and some were faster with Flash. Maybe it’s a matter of user personality or experience.
This is rarely a good strategy of design. Essentially, you’re saying, “I, a UX professional, can’t tell which is better, despite rigorous scientific testing, so I’m going to make the user, who doesn’t know Flash from Bash, make the decision.” First of all, just because a user was faster with, say, Flash, in a particular usability test doesn’t mean that user is generally faster with Flash. The faster performance may be due to various situational factors such as fatigue, learning effects, boredom, distraction, or whatever. You would need to compare each user on multiple Flash and HTML implementations to tell if there are indeed natural-Flash-users and natural-HTML-users.
The more serious problem is this: how are users supposed to know if they are Flash or HTML users, if such creatures exists? What do you put on the website to help them decide? You could provide access to the Flash version through an “Advanced” link, which is a common solution to this sort of problem, but users don’t know you mean “advanced” in a particular Flashy way. They are liable to think “Advanced” gives them more Paint Your Board options and control, like Advanced Search. It might work to label the Flash version “Paint Your Board with Drag and Drop” if users who know how to drag and drop also know what “drag and drop” means. I wouldn’t assume so. Whatever label you use, you’ll have to test it and see if users with a choice paint their boards faster than users without a choice. And remember: the very fact that you give the users a choice means it’ll take more time -time to make the decision and time to correct themselves if they choose the wrong one for themselves at first. Adding choice means adding complexity. You should do it when it solves a well understood user problem, not a poorly understood designer problem.
Revisit Your Designs
Perhaps the most educational thing you can do is study your quantitative and qualitative data to try to figure out what happened. You expected Flash to have faster completion times but be harder to figure out in the first place. But how exactly did that play out? What part of the Flash instructions took too long to read or were too hard to understand? Maybe there are some simple improvements you can make that will put the Flash version on top.
Review your debriefing comments from your users. Maybe certain users do better with Flash while others do better with HTML because of specific design elements in each. Maybe there is a way to combine the best of HTML with Flash to make a superior design for all users. For example, maybe you can add input fields to the Flash design for those who do well with numeric input, but apply those inputs instantly, rather than through a Submit button, in order to reduce the time-consuming cut-and-try of the HTML design.
Unfortunately, in this case, both designs have already been through a set of iterations and testing since their first paper prototypes, and there isn’t time left to further tweak the design. Any further improvements will have to wait until Versions 2.
Review Your Test Limitations
Take a second look at the limitations of your usability test. All usability tests are to some degree a simulation of actual user behavior. If you consider the differences that likely exist between the test conditions and real conditions, maybe you can see a way to break the tie between the two designs. In tightly controlling the task in order to maximize decision potential, you also limited the range of conditions users used the feature. What may be the same performance in your usability test may not be the same on average when you generalize the results to the wider range of conditions that the users will encounter.
In this example, all your users were unfamiliar with both designs for the Paint Your Board feature, which is pretty typical of usability tests. However, it is reasonable to expect that experienced users will do especially well with the Flash versions since it’s main issue was learning it in the first place. A study of the qualitative data may find evidence to support this: the videos show users of Flash first struggling with drag and drop, but once they know what to do, they really take off. Breaking the task into two subtasks and running a couple more t-tests may confirm this.
If Flash is essentially tied with HTML for novice users but it looks like Flash will be faster for experienced users, then on average Flash is better for all users. Perhaps that’s enough to tip the balance. However, Stan tells you that he expects very few experienced users. Even surfers who keep multiple boards in their respective quivers are rarely going to order a custom board -maybe once a year or less for the serious mega-ripper. You don’t know if users will remember what they learned a year ago. It’s not a very strong argument for Flash.
Consider Other Data
You can look at other data to help break the tie. For example, if you had thought to also include rating scales in this usability test, maybe that would tell you which design the users preferred regardless of the completion times. Right. That would’ve been a good idea. Okay, but you have other data -serendipitous data you can extract from the videos, key-logging, and user outputs. For example, in Stan’s opinion the boards resulting from the Flash versions were more creative. Ten users painted their boards exactly the same with each design of Paint Your Board, but eight users painted their boards differently as they went from Flash to HTML or vice versa. Stan says that of the eight users that painted their boards differently, all but one user made a more creative board with Flash.
Hmm. If there were in fact no effect of the Paint Your Board feature design on creativity, then there would be a 50:50 chance of Flash boards being more creative. You turn to the binomial probabilities tables from Stat 101 and look down the 50% population rate column for a sample size of eight. It seems that the probability of 7 or more users getting a better board with Flash is 0.035 given a Hypothetical State of 50:50. We’re still in a two-tailed situation here, so double that probability to get 0.070 (congratulations: you just did a sign test). Statistically, it seems very likely that Flash does indeed result in more creative boards, at least in Stan’s opinion. Maybe users weren’t taking less time with the Flash version, but instead they were using the same amount of time to play with the Flash version to get something extra-rad. Maybe with the HTML version they were merely settling on what they could get in a comparable amount of time.
Sounds reasonable, and maybe that’s enough to favor Flash over HTML. Of course, it’s just based on Stan’s opinion, and maybe Stan was biased, subconsciously rating the Flash boards better because he feels Flash is so swick. To really do this right, you should bring in a bunch of other surfers and have them compare the boards without them knowing if the boards were painted with Flash or HTML. You’d have them rate the relative creativity on a scale so you’re capturing the magnitude of the difference for each user, not just the direction of the difference. This provides you with more information yielding greater decision potential. You’d analyze the data with a t-test on the average rating, with Hypothetical State A set to whatever rating is “no difference.” Yeah, that would be the right thing to do. Too bad there’s no time for more testing.
Consider Other Business Goals
If after considering all the data, there still does not appear to be a clear winner, consider each design’s impact on other UX dimensions than usability. The HTML version provides better accessibility for all those blind surfers out there (there are some) who also want to impress the sweet wahines (or kanes) with their sweet paint job. However, you could argue that just providing mechanical access to the Paint Your Board controls doesn’t really provide adequate accessibility anyway since the feedback is inherently visual. It seems the Flash version is cooler. Stan is personally certainly thrilled by it -this is precisely the sort of heavy technology he was looking for to distinguish his site’s UX from the competition. It’s perf that something as advanced as his CNC surf board painter would have a coolaphonic UI. If the usability is really equal, that’s a good a reason as any other to go with Flash.
Whatever was Better
If the Flash and HTML version are really tied on completion time, maybe you should flip a coin to decide which design to go with.
No, don’t do that. The truth is they aren’t tied. Flash performed better in the sample. With 1- 0.373 = 62.7% certainty you can say that Flash is better than HTML. That’s not a whole lot of confidence when a coin flip gives you 50% certainty of being right, but it’s nothing to ignore either. While it’s plausible that both designs have equal completion times, it is more likely that Flash is better than HTML than vice versa. If all other considerations are equal, and you have to choose one design (and in fact you do), you should go with the one that performed better in the sample. That may not be a great bet, but it’s a better than the 50:50 bet a coin toss would get you.
Of course, going with the lowest average completion time is exactly what you would’ve done if you didn’t bother with the t-test or any inferential statistics, so you may be wondering what the point of it all was. The point was to make an informed decision. There’s a difference between being statistically mushed, where you are uncertain if one design is considerably better than the other, and this situation, where you are confident that there is no considerable difference between the designs on completion time.
So go ahead with Flash, but go ahead knowing that there really isn’t much difference in completion time worth worrying about. Go ahead, recognizing you may want to revisit the feature later and see if you can improve it. Go ahead when your decision is bolstered by consideration of other business goals, other data, and/or awareness of the the limitations of the usability test. In this case, no consideration or statistic by itself is a strong argument for the Flash design, but as you discuss the issues with Stan, it’s clear that in aggregate, going with Flash makes the most sense. After all, a single usability test statistic, however definitive, is just one element of everything that goes into a design decision.
There’s another reason to do the inferential statistics. The t-test for Hypothetical State B led you to conclude that Flash is not considerable better than HTML on completion time. However, remember it was a two-tailed test: you’re justified in also concluding that Flash is not considerably worse than HTML. So it was definitely worth doing the t-tests to give you and Stan some piece of mind. With high certainty, going with Flash won’t be a tragic decision.
Summary Flow Chart
Problem: Comparing averages to determine if a certain design is truly better than another.
Solution: Follow the flow chart below.
- Determine of a one- or two-tailed test is right for you.
- With your client, set your Hypothetical States A and B.
- Check your data distribution for skewness, transforming it if necessary.
- Conduct a t-test for Hypothetical State A
- Calculate your standard error.
- Calculate the t-statistic.
- Determine the p-value with something like Excel’s TDIST() function.
- If the p-value is sufficiently low to represent a tolerable chance of a Type I error rate, proceed with the new design.
- If the p-value represents an excessive Type I error rate, conduct a t-test for Hypothetical State B.
- If the p-value represents a tolerable chance of a Type II error rate, do not proceed with the new design.
- If the p-value represents an excessive Type II error rate, calculate the increase in sample size you need to get a low p-value for either Hypothetical State A or B.
- Increase your sample size.
Here’s an Excel sheet with this post’s data and analysis.