Starting my current job 15 years ago, I opened my new box of business cards and read: “Michael Zuschlag, Engineering Psychologist.” I’m an engineering psychologist? I had no idea. If I were to choose the title for myself, I probably would’ve gone with “Human Factors Engineer.” After all, I got the job by dropping off my resume in the job center at the Human Factors and Ergonomics Society annual meeting. In prior jobs, working on telecommunications and business enterprise software user interfaces, I might’ve accepted being called a usability engineer, but even then I preferred human factors engineer. Usability engineering wasn’t so much something I *was* as something I *did*.

So, “engineering psychologist.” What the hell is that? I suspected it was something made up by some government worker toiling deep in the National Bureau of Occupational Classification, Office of Idiosyncratic Nomenclature. But it turns out to be a well-accepted name, and the more I thought about it, the more I liked it. Engineering. Psychology. Engineering concerns the design and analysis of products to achieve a certain level of performance. Engineers *invent* things. Psychology is the scientific study of behavior and mental experiences. Psychologists *discover* things. Hey, that’s what I want to do! I don’t want to be just a human factors *engineer*, and only make new things, and I don’t want to be just an academic psychologist (a prior career incarnation), and only find new things. I want to be a designer and a researcher, both a builder and an explorer. Comprise the whole research and development life cycle. Be holistic. Who wouldn’t want that?

Well, it would seem, lots of people in the user experience community, because that’s the way they’re heading, dividing and subdividing the field into smaller sub-specializations. Or maybe it’s unintentional, collateral damage from rampant title proliferation. We have:

- Human Computer Interaction
- Usability Engineering
- User Centered Design
- User Experience
- Visual Design
- Information Architecture
- Interaction Design

And those are just the more commonly used ones. I’ve also heard activity centered design, performance centered design, personnel subsystem design, information design, user assistance design, customer experience, contextual design, user experience architecture, and interaction architecture.

After I searched around the web, including making prodigious use of Wikipedia, which is nothing if not the consensus of the interested community on the meaning of things, the definitions seem straightforward enough.

*Human-computer interaction*(or HCI, but sometimes CHI, just to have some variety) is a field concerning “the design, evaluation, and implementation of interactive computing systems for human use.”*Usability engineering*(or UE) also concerns the design of human-computer interfaces but specifically seeks to maximize the user’s performance on dimensions such as learning, efficiency, and accuracy. So it provides the goal for user interface design.*User-centered design*(or UCD, but sometimes “human-centered design,” to specifically exclude potential orangutan users, is the process of designing a user interface through iterations of collecting user data (especially by observing user interactions) and creating designs. So that’s the means to the goals annunciated by usability engineering.*User Experience*(or UX) is all a user perceives about a product. That includes the usability the user experiences, but also any emotional and motivational states. Designing a UI for a positive user experience implies expanding the goals beyond user effectiveness and efficiency of usability to include anything else that contributes to user satisfaction. UX also implies designing more than the product itself, but anything else (e.g., the packaging, the technical support) that affect the experience of the product.- Visual Design is the application of visual details (e.g., color, line, shape, space, images, typography) to create a target user experience, using visuals to communicate both information and emotion. That is, it’s the medium of design. It’s synonymous with graphic design applied to interactive products.
*Information Architecture*(or IA) is the design of the “organization, labeling, navigation schemes and retrieval mechanisms… [and] structural design of an information space to facilitate task completion and intuitive access.” That is, it’s another medium of design. While visual design uses visual details to create a targeted user experience, IA uses organization and labeling to create a targeted user experience.*Interaction Design*(or IxD) is the design of “digital products for people’s use,” which is pretty much the same as HCI, expanded beyond computers per se to include anything using digital electronics.

Obviously, there is a lot of overlap in the definitions and even more in practice. For example, while theoretically HCI can pursue any goal in the design a of UI (e.g., easy implementation), in practice, HCI work almost always seeks to improve usability, or some other part of UX. UCD is sometimes used to mean the philosophy that design should conform to the capabilities and tendencies of the user in order to achieve the best user performance, making it virtually synonymous with UE. Furthermore, while in UE and IxD use a variety of design methods and tools (e.g., cognitive walk-throughs), iterative design with observation of users has the central role, making UE and IxD nearly synonymous with UCD. IxD includes communicating with the user, so it overlaps with visual design. UX by definition includes all experiences about the product, which has profound implications for those working on the emotional and motivational aspects of user experience. However, if the questions on UX Stack Exchange are any indication, the dominant concern remains the usability of the product (i.e., making its use clear, easy, and accurate). Put a link on a web page, and you’re simultaneously doing visual design (the color and font of the link), information architecture (specifying what it links to and what its label should be), and interaction design (instantiating it a link -an interactive control that responds to the user).

Practically the greatest differences among these names are in their connotations. HCI implies producing papers for scientific journals, rather than products for the marketplace, like the other names imply. Practitioners of visual design, IA, and IxD tend to be associated with different academic departments (visual arts, library science, and psychology, respectively). UCD tends to emphasize design over testing and analysis and UE vice versa. Also, as Whitney Quesenbery observed, HCI, UCD, and UE also mean you’re old, compared to those who align themselves with the UX or IxD label. (Mental note: add more gray to my beard in the illustration at the top of this blog.)

Personally, I don’t have a problem with any of that. It’s a similar situation with human factors engineering, human performance engineering, engineering psychology, and ergonomics. A job candidate could use any of those labels for his or her knowledge and experience, and I’d see them as pretty much equal; I’d look for more details in their resume to distinguish them. Human factors engineers may graduate from psychology or engineering departments. No one cares which it is.

On the other hand, I have to wonder if UX needs so many overlapping labels, and worry that it sows confusion among ourselves and our clients, employers, and other laypeople. Even more worrying, certain people in the UX community feel a need to define and divide the above into *non*-overlapping subdisciplines, to inflate the subtle connotations into distinct formal dimensions.

Why so many names for largely the same discipline? The three different academic origins of visual design, IA, and IxD helps explain why those three names exist, but doesn’t explain why we have also have HCI, UE, and IxD, which all come from largely the same academic tradition. I see a couple of reason for name proliferation:

- Keeping it Fresh.
- Expanding the Scope.

Maybe it’s the way the business world sees the digital age: if it’s something you heard about 18 months ago, then it must be obsolete. In the fast-pace high-tech world, you need to re-invent your title periodically or else you seem outmoded and irrelevant, like ASP programmers. To move your career forward, you re-brand yourself with the latest buzzwords, like “content strategist,” rather than tired old “editor-in-chief.” At one government agency, I learned that information technology (which used to be called data processing) isn’t “information technology” any more. Now it’s “knowledge management.” Except, now they’re changing *that* to something new, but I can’t remember what. Maybe “solution innovation” or “innovating solutions” or something. Civil engineers don’t have to put up with this. The name of their discipline remains unchanged for over two centuries, even as it has come to sound a little weird (they’re called “civil,” not because they’re necessarily polite and inoffensive, but because they designed and analyzed civilian works, as opposed to military works, which was once the only other kind of engineering there was).

The tech world’s requirement for periodic re-branding surely applies to us, but even more so. Branding itself is the latest business craze, which UXers have embraced, so if we’re going to build credibility for our discipline with the suits outside the UX community, we better be supremely branding ourselves to show that we get it. We gotta have the coolest names.

“Visual design” is cooler than “graphic design.” I mean, you’re designing *visually*, man, not graphically, the latter sounding like you forgot your pants that morning. More seriously, “graphic” implies physically printed matter, and that’s so twentieth century. I mean, no one uses paper anymore. “Architect,” as in “information architect,” is supremely cool. That carries prestige, so much so that early on there was a push for “interaction architecture” rather than “interaction design.” “Design” won out, maybe because it felt newer and more mystical. Real architecture comes down to bricks and mortar, and we all know that’s precisely what the web is driving to extinction. Design, in contrast, is virtual. Like The Cloud, it doesn’t really exist, so it must be the future.

We also have to have the coolest abbreviations, on par with AJAX and .Net. Bonus points for mixed-case names for that extra techie edge. It’s widely accepted that “X” is the coolest letter in the alphabet (arguably; I passionately believe “Z” is clearly cooler). From this perspective, the most significant way UX and IxD differs from UE and HCI is that the former have an “X” in their respective abbreviations.

Well, hey, if rebranding works, why not do it? After all, UX needed all the credibility it can get to go up against the entrenched ignore-the-user attitudes that prevailed in the business world. That would be fine as long as we don’t take our own rebranding too seriously, and recognize that the name-of-the-month is for our audience in the business world, not for us. But it seems some do take it seriously, and spend considerable effort picking apart the names, confusing themselves. It can divide us, pitting us against each other. Oh, man, you’re not still doing user-centered design, are you? *I’m* doing interaction design. Like the latest detergent, it’s new and improved. *I* should get the consultant contract, not you. What’s new and improved about it? New and improved packaging.

Confusion and conflicts like that end up weakening us, wasting our resources and undermining the unified voice we need for credibility. Sometimes, these conflicts end up looking kind of pathetic, such as a few years ago when a couple of venerable gurus, in what looked like an attempt to re-assert their relevance, tried to rebrand UCD to convince everyone they had a new breakthrough design method. Not that there isn’t progress in UX, but progress in any field is usually gradual. We don’t get a Newton or Einstein every year. It’s counter-productive to come up with new names like activity-centered design or research-informed design (or whatever Spool calls it) just because some people are doing UCD badly. If UXers are not paying enough attention to the task, tell them to pay more attention to the task. If they’re being too dogmatic, tell them to stop being so dogmatic. Don’t pretend you’ve invented a new way of doing design that deserves its own name.

At this point in our short history, I suspect re-branding is reaching the point of diminishing net returns. Pretty soon even the pointiest hair of the pointy hair bosses will pick up that it’s a gimmick. Maybe it’s time we stopped.

We know we got a good thing. Our methods and tools work. We’ve measured dramatic improvements in user interfaces after doing our job. But “user interface” implies computers -that’s what users use (either that or recreational drugs). Surely our methods and tools can work wonders for other forms of technology too. We shouldn’t be limited to user-centered web sites and software applications. The public also needs cook-centered stoves, photographer-centered cameras, nurse-centered medical apparatus, shooter-centered guns, pilot-centered airplanes, audience-centered entertainment systems, and showerer-center showers. It’s natural that we’d seek to expand the scope of our discipline and improve the usability of devices beyond general purpose computers of various form factors.

Oh, wait. That’s already been done. That’s engineering psychology / human factors engineering. Usability engineering is a sub-discipline of it.

Ah, but what about *user experience*? By including emotional design and addressing user motivation, it overlaps with human factors, rather than falls entirely within it. If we can expand the “user” in user experience to include the operator of any technology, then human factors becomes a subdiscipline of *UX*. What a coup!

The noble effort to expand UX to everything is apparently the second driver of name proliferation in UX. To do that we need to shed any specific association with computers inherent with names like HCI, user-centered design, and usability engineering, and we certainly don’t want to be associated with physical printing with a name like graphic design. So we came up with the names visual design, information architecture, and interaction design, all technology-neutral names. They apply to anything visible, informative, and interactive, respectively. Like “experience” in UX they raise the level of abstraction, shifting focus from the specific artifacts to be designed, like a computer-user interface, to the psychological -what people sense, feel, know, and do with respect to unspecified artifacts.

Take information architecture, for example. I get the feeling that at the start of the information revolution in 1994, there was excitement about all the new ways of organizing information that will be available by shaking off the shackles of physical space by using a hyperlinked network. However, in the end all we really use is linear and hierarchical structures (or both at once), two forms of organization going back to ancient times. Maybe that’s a human limitation. Maybe we are hardwired to think in sequences and categories (and sub-categories), and anything else becomes too confusing for the typical user. Consider how much training and experience users need to effectively query multiple tables in a relational database (an information structure that is curiously ignored by information architecture). It’s not natural for us.

Do we need a new name for doing in HTML the same thing we’ve been doing for hundreds of years with things like book chapters and shelves, card catalogues, and the Dewey decimal system? IA seeks to apply the “principles of design and architecture to the digital landscape.” But while physical space is defined by three continuous dimensions, hyper-linked information is defined by one discrete dimension: the number of links. Merely linking web pages together is Little IA. How do we expand it to Big IA? Well, if you take “architecture” to mean “the designed form of something,” then you expand the scope of information architecture to encompass all ways of representing information, and not just on web pages. That implies all sorts of things: IA may include how information is arranged on the screen, how graphics are used to convey structure and information,and what displays and controls are used to represent what information, including information acquired from the user.

Unfortunately visual and interaction designers were also thinking big. In the process of expanding their scopes, they collided with the information architects. For example, successfully retrieving information clearly depends on the information architecture, but the act of retrieving is clearly interactive and therefore legitimate turf for interaction designers to claim. The information hierarchy on a page is marked by the graphic use of space, color, line, shape, and font, implying it’s visual design, not information architecture.

Well, we can’t have this. We have to have clearly defined and divided roles, practices, and even job positions.

The problem is there isn’t that much to web site design, which is where UX is currently concentrated. You probably make a UI that users use to access informative or entertaining content. For the vast majority of projects, the only information of consequence is text, the only interaction is clicking links, and the only visuals are the template for the page. Pictures typically provide secondary illustration of the text content, and occasionally there’s audio and video, but mostly text is the medium for the content. In theory, IxD involves designing how users accomplish many operations with content: create, modified, delete, and transfer it. If you’re lucky enough to work on bespoke web apps for industrial use, then you may get to make a sophisticated UI for a database. But if you work on consumer web sites, interaction is mostly navigation -clicking links. Occasionally, you take a little content from the user, maybe a paragraph of text or the choice of a picture or a delivery address and credit card number. Occasionally, the user does something other than consume content, such as clicking a Purchase or Share button. Contrast that to applications like word processing, where navigation is a small part of the interaction (confined largely to the Open dialog), and complex content manipulation is the bulk of the interactions. In web sites, there’s just not much IxD that isn’t also IA.

Try to divide web work up among each subdiscipline, and each gets so little. You have absurdly limited roles for each discipline, such as information architecture only dividing text into pages, interaction design only laying out the pages, visual design only determining the emotional “feel” of the page, and usability engineering only doing usability testing. These limited roles are not consistent with how these disciplines have been practiced. In effort to expand UX to encompass more, we each end up getting less.

Conflict and division are not productive, but the feeling seems to be that this is a temporary state. We’re expanding UX beyond computer UIs, after all, which will accentuate our differences. Not everything is both informative and interactive, for example. A lot of things are one or the other. Soon there will be plenty of work for everyone. Above the interaction designers, visual designers, information architects, and usability engineer will be UX managers, strategizing and coordinating the whole experience of a corporation -all the touch points with customers: all the communications, marketing, services, products, and support. The customer experience with a company determines if the customer buys from company, so UX is central to company survival and growth. We UXers should be in charge of running companies. And we will be.

Right.

I’ve been hearing how UX will soon expand throughout the business world for ten years now. While some are impressed with the progress we’ve made, I don’t see much progress on that front. The truth is we are software user interface designers. We have been successful in convincing most businesses they need UX for their web sites and apps, and that’s a substantial accomplishment that has ensured our long-term viability. But expand beyond the UI? No. There are no UXers designing theater, dining, weddings, retail outlets, packaging, and customer service experiences. Very few of us made the short step to digital hardware or embedded software design. Over 90% of the questions on UX Stack Exchange concern UI design. Even saying “we’re UI design” is being generous if you’re limited to doing consumer web sites or web apps. Our only expansion in the past decade has been to add mobile, which mostly means web sites and apps for a different form factor and technology. Our aspirations have exceeded acquisitions. In all my non-web-related work, I have never encountered anyone with titles like visual designer, information architect, or interaction designer. Despite the ambitions of these disciplines, you don’t see them in the making of airplanes, cars, ships, or trains, or even the computer-user interfaces to these technologies.

Every discipline thinks it’s the lynch pin, the solution to all problems, and the deeper you go into a discipline, the more likely you feel that way, because the deeper you go, the more abstract it becomes, and the more you see connections and inclusions of other fields. I bet the same thing happens in non-UX disciplines: smart economists, geographers, anthropologists, biologists, physicists, medical doctors probably all think they have The Solution to things. In the business world, expert marketers think marketing should drive the whole company, and I bet finance experts think finance should drive it, and likewise for experts in accounting, project management, engineers, personnel, and probably even the lawyers.

Frankly, everyone probably can make a pretty good argument on why they’re so important. The result is no one is going to get the sole control of the business world. UXers are not going to be running corporations, at least, no more often than anyone from any other business-related discipline.

The best case scenario for UX expansion is that you convince the suits that UX covers all touch points. But if we succeed on that, we still lose. If everyone believes UX to encompasses everything in addition to web sites, then people from other disciplines -the people already working on everything else -will commandeer the UX name. They will put the latest buzzword on their resume if we succeed in making it trendy. And soon, like so many buzzwords before, it’ll come to mean nothing. Meanwhile, we’ll still be doing only the web site.

More likely others will redefine our names for us. We may think that a name like interaction design will give us leverage to expand beyond web sites. Words are powerful, as any politician knows, but ultimately, their effect is transitory. Names are ultimately defined not by their components or etymology but by what people currently associate them with. “Ethnic cleansing” was originally coined by Serbian politicians because it made their policies sound good, but then the world saw what ethnic cleansing really was, and now no politician in the world will say they favor it. So, you can call yourself an information architect or interaction designer or whatever, but ultimately to everyone else, you’re the web guy.

The main reason that UX hasn’t expanded beyond UI is there’s already other disciplines controlling the territory we want to expand into. Human factors engineers and industrial designers got the hardware covered. Marketers got the customer experience, packaging, and retail. Theater professionals have theaters, restaurant management has restaurants, wedding planners have weddings, and so on. And you know what? The people in these disciplines are better at it than us. They’ve been doing it far longer and have all the specialized knowledge.

I don’t want to discourage UX from making forays beyond UI. I’m as biased as the next UXer, so naturally I think that UX has more to offer than nice web sites. However, it’s more realistic to aim to *cross* UX with other disciplines, rather than expand UX to encompass roles already covered by other disciplines. There’s a difference between cross-discipline exchange and taking something over. In the latter, you get conflict. In the former you get embraced. I was hired into a transportation human factors division partially because my UI design experience was regarded as relevant with vehicles becoming more computerized.

User experience and its subdisciplines are touted as vertical, being relevant across technologies, and I believe that’s true. But to work in any horizontal discipline requires unique knowledge specific to that discipline. We need to recognize that we have as much to learn from other fields as they have to learn from us. For instance, if you want to apply UX methods to cartography, go study cartography for year or two, then bring over your UX skills. It’ll work well. But now you’re as much a cartographer as a UX designer. Similarly, a UX manager is basically a marketing manager who knows what goes into web design. Both management and marketing are disciplines of their own, so if you want to be an UX manager, get an MBA in marketing to compliment your bachelor’s degree in UX/IxA/IA/whatever. You’ll succeed.

Along with acquiring the knowledge, crossing UX with a non-software field implies you also must leave some traditional UX knowledge behind. Take service design, for example. From what I can see, it’s not so much an expansion of UX from UI design, but a branch, merging with the market research branch of marketing. Service design took only the user research and task analysis skills from UX (i.e., the usability part, narrowly defined), and left behind most of the UX knowledge and skills. Knowing how to organize information, knowing which GUI control or pattern is for what, knowing how to make a 100-by-200 pixel image recognizable -none of that is relevant for general service design, unless you’re designing a software UI as part of the service.

I don’t bemoan the failure of UX to expand beyond software user interfaces. I don’t even bemoan that it’s concentrated in web design, although I would personally be bored if that’s all I did. The fact is the web is a whole lot. It’s a technology that has fundamentally changed our lives. In just two decades it’s become part of most peoples everyday interactions, and, with mobile, it’s becoming more ubiquitous.

We should be proud that we’re making a central contribution to that revolution, putting the awesome power of exabytes of information in everyone’s hands wherever they are. If a layperson asks what you do, don’t tell them, “I structure information to support findability,” don’t say “I shape interactive systems for use by people,” or “I manage the growth of complexity,” and certainly don’t say, “I design experiences.” Say, “I design web sites.” If they ask what that means, you can say something like, ”Have you ever not been able to find something on a web site that you knew had to be there? I fix that.” Let’s stand up and clearly say what we are, rather than shroud ourselves in abstract language that will more likely confuse our listeners than impress them.

More formally, we can take a cue from our sibling vertical discipline, industrial design. Industrial design seeks to make physical products look well and work well for people. That’s what UX is, only substitute software for physical products. I’m not saying it’s easy to do UX. You need a comprehensive background in computer technologies, psychology, information management, and aesthetics to do it. You need to be a well-rounded professional that understands categorization, layout, behavior, graphics, testing, and analysis. What I’m saying is it’s easy to define. For the purpose of clear communication, we don’t really need names like usability engineer, interaction designer, and so on.

So, if UX is mostly web design now and in the future, where does that leave the subdisciplines? We’re back to a relatively small pie to divide up. I think the answer is that there is no need for IxD, IA, usability, and visual design to be separate subdisciplines, specializations, or job titles. I don’t see why a single person can’t have sufficient skills to do all the practices in UX. Saying your web design team needs a interaction designer, information architect, and visual designer is like saying it needs an HTML coder, CSS coder, and Javascript coder. How many college classes does it take to cover it all? Is there such thing as Advanced Information Architecture or Usability Testing II? Part of me wonders if there is a natural division between functional design and aesthetic design, but maybe that’s only because my own academic background in aesthetic design is weak. Architects (real ones) and industrial designers don’t seem to have a problem covering both function and aesthetics.

I also don’t think it makes sense to regard IxD, IA, usability, and visual design as separate roles or practices, because even there they have too much overlap. For example I wouldn’t define card-sorting as a information architecture practice. Although it’s a key technique for organizing content, you can also use it to make a menu of commands rather than web pages -isn’t that then practicing interaction design? Rather than dividing ourselves, maybe it makes more sense to divide the user interface, similar to how the UI is divided into HTML, CSS, and Javascript. Let’s treat IxD, IA, usability, and visual design as interrelating characteristics of the UI. A web site has an information architecture communicated by its visual design and accessed through it interaction design achieving a certain level of usability. For example you might say, “This icon has poor usability, compromising the information architecture. Let’s change the visual design to improve the interaction design.” Or you could say that users couldn’t find the right information because the icon was a poor label because users couldn’t tell what it was supposed to be a picture of. Sometimes it’s best if we ditch the jargon.

Recognizing the close relation among IxD, IA, usability, and visual design doesn’t resolve the problem of what to call yourself. I can sympathize that you don’t want to go with “web site designer” or, “user interface designer” (if you do more than web sites). However accurate it is, I understand that it’s not a good career move given certain audiences.

Fine. Call yourself whatever you want to the suits. Change it every 18 months so you appear to have the latest business-critical skills. Save time, and make a Title Generator app that randomly combines the trendiest terms, coming up with things like Experience Engineer, or Knowledge Strategist, or Engagement Synergist.

Oh, but what should we call ourselves *to ourselves*? Well, if you don’t want to use “UI designer,” I’m okay with “user experience designer.” UX seems to have gained widespread use and understanding. Some seem hung up on the thought that you can’t design experiences, only artifacts (the same can be said about interaction and visual design). But, seriously, we know what we mean: “UX designer” is shorthand for someone who designs artifacts in effort to achieve a specific user experience. The name is not going to lead to delusions of power. We already have those.

I’m also okay with user experience practitioner, user experience specialist, and, of course, simply UXer (”uckser”? “oosker”? “oozer”? “youkser”? Just not “uzer” to avoid confusion). If you lean towards working with one characteristic of UI more than others, than go ahead and use one of the subdiscipline names consistent with its connotations.

I really don’t care.

On second thought, maybe I shouldn’t be so quick to dismiss the potential of UX to expand beyond software UIs. Let’s have a discipline name we can grow into, with latitude to expand, however improbable that is. Just because it hasn’t happened in 10 years, doesn’t mean it won’t ever happen. We should have a name that avoids “user” or other specific association with computers. We could go with simply “experience designers,” but the problem with that is “experience” isn’t the whole thing. That’s only sensations, perceptions, cognition, motivation, and emotion. To fully include interaction design, we need to cover human *behavior* in addition to human experience. And “designer” is also too specific. To include usability testing we also need to cover the empirical and analytic activities. The name should emphasize the centrality of the human mind in our work on products. It should pay homage to the scientific tradition from which our methods spring, but still indicate the practical applied nature of our work.

I’ve just the name: Engineering psychology.

**Potential solution**:

- Embrace the overlap among our subdisciplines.
- Don’t take efforts to be trendy too seriously.
- Expect that we will remain user interface designers.
- Look to cross with those working outside UI, not replace or subsume them.
- Celebrate that we make significant contributions to the information revolution.
- Use subdiscipline names to characterize the UI rather than than each other.

- Calculating probabilities for two-category responses.
- Binary choice testing.
- A-B testing.
- Normal approximations for categorical data, and its limits.
- Binomial, Fisher’s Exact, Chi-square, and G tests.
- Multiple design (”multivariate”) testing.
- Pitfalls in on-line apps.

**Prerequisites**: Stat 101, Stat 201, Stat 202, license to operate a time machine.

Over at DIY themes, a simple A-B test lead to a startling conclusion: Removing “social proof” text at sign-up doubled conversions from 1% to a blistering 2%. Over 80 comments were exchanged debating how such a “sure-fire persuasion tactic” hurt web site performance. Few commenters are suspicious of the small “sample size” (the given numbers suggest 8 conversions out of 793 visitors for one site design, and 15 conversions out of 736 visitors for the other design). However, only Stephen comes close to asking the right question: what is the p-value? Using a p-value calculator supplied by the service that did the A-B test, he gets 0.051, which he declares not significant (i.e., *p* is more than 0.050).

Is that enough to ignore the results? If you read the first post in this series you know that “statistical significance” is not some magic border between reality and falsehood. For the purposes of making business decisions, the difference between 0.051 and 0.050 is barely anything. It’s equal to, well, the difference between 0.051 and 0.050. But that doesn’t mean I think we’ve evidence that social proof had backfired in this case.

“How many?” It’s the simplest quantitative data you can imagine: the mere count or frequency of something. How many users fail to complete a task? How many conversions? How many prefer Design A over Design B? Simple as it is to measure counts (or related percents), it is among the most complicated forms of data to analyze, with several viable options each with their own strengths and limitations. It’s easier than you might expect to get it wrong.

But how complicated can counts be? After all, we’ve already dealt with counts in Stat 101. There, we calculated the p-values of two or more users having a problem with a site or app given a couple hypothetical states. For example, we determined that, assuming a population failure rate of 10%, there is a 0.028 probability of 2 or more users failing out of a sample size of three. Based on that, we concluded that the actual population rate is over 10%. We did it all without doing any math.

As Tonto said to the Lone Ranger when they were surrounded by hostile Sioux, “What do you mean ‘we,’ pale face?” Of course, *I* did the math for you behind the scenes. What I actually did was, using the 10% hypothetical chance of a single failure, calculate *every possible* path to getting 2 out of 3 failures. I calculated the probability of the first two users failing but not the third, the probability of the last two users failing but not the first, and the probability of the first and third user failing but not the second. I summed all those probabilities to get the probability of 2 out of 3 failing. Then I did the same for 3 out of 3 failing (there’s only one path for getting that) and added *that* in to get the total probability of 2 *or more* failing for a sample size of three. To make the graphs and table for a sample size of three, I did the same for 3, 1 and no users failing out of three.

The probability of getting a certain count of events (e.g., failures) given the chance of a single event (failure) is a *binomial *probability. The “bi” in “binomial” in this case means “two possible outcomes” for each event -succeed or fail for example. That’s in contrast to other kinds of events, like a user rating something on a 35-point scale, which has 35 possible outcomes, or the time to complete a task, which has roughly a bjillion possible outcomes, depending on the precision of your stopwatch. For those situations, we already saw that you get p-values on the averages using a t-distribution as our sampling distribution.

For the number of users failing (versus succeeding), we use the *binomial distribution* as a our sampling distribution, which comprises the probability of every possible result (counts of users) of every possible sample for a certain sample size and hypothetical state. Each column of each table in my tables of binomial probabilities from Stat 101 is a sampling distribution.

The process for calculating binomial probabilities to get the sample distribution actually isn’t so complicated. We’re helped tremendously by assuming that the users perform *independently* -that the chance of the second user failing is uncorrelated by the chance of the first user failing, for example. Specifically we assumed that each user has a 10% chance of failing (our Hypothetical State A) regardless of how every other user actually performs in each possible path. Such independence of events is reasonable to expect as long as you’re testing each user separately.

Independence simplifies life for the statistician. When you assume independence, then the you can get the probability of any path by multiplying together the probability of each event in that path. For example the probability of the first two users failing but not the third is 0.10 * 0.10 * 0.90 = 0.009 (the last number is the probability of *not* failing). It also means the probability of the first two users failing is the same as the probability of the last two users failing is the same as the probability of the first and third user failing, thanks to the good ol’ commutative property that you learned in grade school. Once you know the probability of one path, you know the probability of all paths. So the way to actually calculate binomial probabilities is to calculate the probability of one of the possible paths then multiply that by the number of possible paths. In this example, there are three different combinations of success and failure that give me two out of three, so the chance of getting precisely 2 out of 3 failures is 0.009 * 3 = 0.027.

Mathematically, it’s *p* = *P*^*f* * (1 – *P*)^(*n* – *f*) * *n*! / (*f*! * (*n* – *f*)!), which breaks down as:

Where:

*n*= sample size*f*= your number of observed events (failures)*P*= your probability (not percent) in your hypothetical state

Where *f*! isn’t a loud and obnoxious f-er, but the “factorial” of *f*, which is:

*f*! = *f* * (*f* – 1) * (*f* – 2) * (*f* – 3) … all the way down until you get to 1.

Likewise for *n*!.

That’s the process for getting the probability of precisely *f* out of *n* events. To get the probability of *f* or more events, you also calculate the probability of *f* + 1 events, *f* + 2 events, *f* + 3 events up to and including *n*, then add all those probabilities together. In the case of two or more out of three, the only path remaining is the chance of getting three failures in a row, which is 0.001. Add that to the 0.027 we got for precisely 2-out-of-3, and you get the final answer of 0.028.

Here’re all the binomial probabilities for a sample size of three with a hypothetical state of 10%. Just as you can sum the probabilities for each precise *f* to get the probability of *f*-or-more, you can also sum them the other direction to get the probability of *f*-or-less.

f |
p(precisely f) |
p(f or less) |
p(f or more) |

0 | 0.729 | 0.729 | 1.000 |

1 | 0.243 | 0.972 | 0.271 |

2 | 0.027 | 0.999 | 0.028 |

3 | 0.001 | 1.000 | 0.001 |

By now it’s clear that binomials are not so much complicated as boring. The probability of getting two or more out of three isn’t so bad, but if you have a sample size of 30 and you want to know the probability of getting 20 or more failures, that means getting the probability of precisely 20, 21, 22, 23, and so on to 30 failures, then adding them up. This is 2012. We don’t have that kind of attention span. So write a program or script to loop through the calculation of *f*-or-more failures (one of my earliest desktop programs -in Pascal -calculated binomial probabilities), or set up a spreadsheet that replicates the calculations through *x*-to-*n* number of cells.

Or why bother to re-invent the wheel? Excel, for example, has a BINOMDIST() function. Enter the observed count of events (*f*), the sample size (*n*), the probability from the hypothetical state, and TRUE for “cumulative,” and Excel gives you the probability of *f* events or less. That is, it gives you the left tail of the distribution. To get the probability of *f* events or more, calculate *f* – 1 events or less and subtract that from one:

p(*f* events or more) = 1 – BINOMDIST(*f* – 1, *n*, *P*)

For example, to get the probability of 2 or more failures out of three, calculate 1 minus the probability 1 or fewer failures.

This is reminiscent of what we did with the t-distribution when we wanted the p-value through the center of the distribution. A key difference is that the binomial distribution is discrete -the possible observed values are integers -while the t-distribution (like the normal distribution) is continuous, allowing any fraction. In a t-distribution the probability of *x* or more is essentially the same as the probability of one minus *x* or less. But in a discrete distribution, the probability of *f* or more is 1 minus (*f* – 1) or less.

If it’s too uncool to use old-fashion desktop application, you can search for a web or mobile app to calculate binomial probabilities. More on that in a minute.

Now you know where I get my Stat 101 numbers from. More practically, now you can calculate your own probabilities for values of *P* that I didn’t include in my tables (e.g., 16.7%, 3%, 0.01%).

But supposed you fall through a moth hole in the fabric of space-time to 1978, a time when mobile apps, iTunes, the web, PCs, and computers were not routinely available or existing. You immediately notice that Mick Jagger is already old. Almost as horrific, you next notice that people’s attention span really isn’t significantly different than today’s. Statisticians didn’t have the patience to calculate binomial probabilities by hand back then either. Instead, they worked out short-cuts.

Suppose you wanted to improve the user experience for your users in 1978, by… umm… providing blank punch cards in an assortment of colors. As your users sit in windowless institutional cinder block keypunch rooms clacking away on battleship gray keypunch machines, maybe a little more color will trickle some sunshine into their dreary little lives (the users, not the keypunch machines).

You float the idea to Sven, the Supreme Lord of the Mainframes and the original Bastard Operator from Hell (BOFH). Sven is delighted by the idea. In fact, Sven is willing to consider investing in the more expensive colored punch cards if you can show there’s *any* preference for them among the users. You’re surprised that a BOFH would lift a finger to help users, but this is the *original* BOFH -he’s *literally* a bastard operator from Hell. With a tear in his eye, he’ll proudly tell you how is unmarried mother worked days and nights in his tiny hometown of Hell, Norway so he could get an education and become the Supreme Lord of the Mainframes.

To test users for card preference, you count how many users choose colored versus plain punch cards. To keep your users independent, you have to observe at a time and place when only one user at a time goes to get punch cards so that they don’t influence each other. Also, if you see the same user twice, you only count his or her choice for the first time in order to maintain independent observations (yes, his or *her* in 1978; just ask my sister-in-law). You also have to switch the shelf positions of the punch cards between each users to balance out any possible position effects (maybe users naturally tend to grab from the stack of cards on the left or right). After doing this for 40 users, it’s getting monotonous, but the data are very compelling: 29 out of 40 users took colored punch cards -almost 3 out of 4.

So let’s set Hypothetical State A to represent No Preference, which would mean the population chance of choosing colored over plain is 50%. Using your binomial app or BINOMDIST() function, you see that the probability of getting 29 or more users out of 40 for a 50% population rate is 1 – BINOMDIST(28,40,0.5) = 0.0032.

Looks like you have a solid statistical case for users preferring colored punch cards.

Oh, right, you can’t get binomial probabilities so easily because it’s 1978 and your smart phone’s battery is dead from too many games of Angry Birds while you were waiting for users to come get their punch cards. This is when you appreciate the shortcuts pioneered by yesterday’s statisticians.

Take a look at the binomial distribution for 40 with a 50% hypothetical population rate. This graph gives the probability of observing each *f* precisely (i.e., BINOMDIST(*f*,40,0.5,FALSE)), so to get *f*-or-more, you’d add all the bars from *f* to the right tail of the distribution.

Looks like a normal distribution, doesn’t it? In fact, thanks to the Central Limit Theorem, binomial distributions tend to be normal distributions, much like sampling distributions of averages tend to be normal distributions. Any binomial distribution will tend to be normally distributed with:

population average = *P***n*

standard error = SQRT( *n* * *P* * (1- *P*) )

We can use the normal distribution much like the t-distribution by converting our observed statistic (29, the count of users who chose colored punch cards) into units of standard error of the normal distribution. The new converted statistic is called “z”. I tell statistics groupies it’s named after me. Really impresses them.

z = (*f* – *hypo*) / *se*

Where

*f* = the observed count, 29 in this case.

*hypo* = the hypothetical state for counts (i.e., *P* * *n*, or 20 in this case).

It works just like the t-statistic except that we have the actual standard error, not an estimate based on the standard deviations in our samples. That means there are no degrees of freedom to worry about, and you use the standard normal distribution, rather than the t-distribution adjustment for using an estimated standard error. In Excel, that’s the NORMSDIST() function, which gives the one-tailed probability of *z* or less, so you have to subtract it from 1 to get *z* or more.

Applying the normal approximation to our punch card data, we get:

*z* = (29 – 0.5 * 40) / SQRT(40 * 0.5 * 0.5) = 2.85

Using Excel’s NORMSDIST() to get the p-value, remembering that NORMSDIST() gives the p-value for *f* or less.:

p-value = 1 – NORMALDIST(2.85) = 1 – 0.9978.= 0.0022

Compare to the true value we got from the binomial distribution, that’s not bad for an estimation. I mean, 0.0032, 0.0022, whatever. The point is you can’t plausibly say there is no preference for colored punch cards.

But the normal approximation is only an estimation based on the assumption that your sampling distribution (the binomial distribution for your sample size and hypothetical state) is close enough to a normal distribution. It’s the same assumption underlying the t-test, so it’s accuracy can be foiled by the same combination of the things:

- The smaller the sample size the worse the accuracy.
- The more skewed the distribution the worse the accuracy.

For sample size effects, consider the probability of getting 6 or more events out of a sample of 8. The true binomial probability is 0.1445 while the normal approximation estimates it’s 0.0786, about half as much. Contrast that with the 0.0032 versus 0.0022 estimates above, where the estimate is off by about a third.

Skewness is related to *P*, the probability in your hypothetical state. The more *P* deviates from the midpoint of 0.5, the more lopsided the binomial distribution becomes. That makes sense: the lower the probability of each event, the more frequently you’ll get samples with low counts of events–the more they’ll tend to pile up at the low end of the scale. Here, for example, is the binomial distribution for a sample size of 40 when the population rate is 5%:

The opposite holds when the population rate of each event is over 50%.

For example, instead of measuring colored punch card preference, suppose you ran your 40 users through a usability test and counted how many failed to complete the task. You might need to run such a large sample size if the cost of both Type I or Type II errors are so large there’s little gap between Hypothetical States A and B . If Hypothetical State A is 0.05, then the true probability of observing 4 failures or more out of 40 users is 0.1381, but the normal approximation estimates it at 0.0734 -again half as much. Not to good to underestimate your error rate so much when costs are high.

Put both small sample size and skewed distributions together, and you can be way off. Remember that the true probability of 2 or more out of 3 given a 10% population rate is 0.0280? The normal approximation estimates it’s 0.000535!

There are some attempts to reduce the inaccuracies of the normal approximation. Some of the inaccuracy of the normal estimate can be traced to the counts ( *f*) in the binomial distribution being discrete integers, but the normal distribution is a smooth continuous function. Yates continuity correction seems to me a reasonable and effective way to compensate for this, but apparently it doesn’t work as well as it should when you run it on realistic data, so it’s controversial today.

The real solution is simply don’t use the normal approximation. This isn’t 1978 and we’re not punching cards any more. Just about anyone with any access to today’s technology can calculate the true binomial probabilities.

The only reason I can see to use the normal approximation today is when you’re doing a quick check literally in your head. For example, Jason Cohen has described a simple rule that allows you to decide if an observed count has a p-value less than 0.05 given a population rate of 50%. I’ve personally used the rule that the 95% confidence interval of the observed rate of a binary event is about the same as the inverse of the sample size (1/SQRT(*n*), so you need a sample size of 100 to get results you’re very sure are within 10% of the true value). I use it to estimate the sample size of poll results on TV when they report the “margin of error.” Another way to wow the stat groupies. Both of these mental tricks use the normal approximation of the binomial distribution (technically, Cohen uses the chi-square distribution with one degree of freedom, but that’s a mathematical function of the normal distribution so it gives the same answer). Go ahead and use them, but realize you can be off by a factor of 2 pretty easily, and that either trick only works well with large samples sizes and population rates not too far from 50%.

The real reason I spent so much time on the normal approximation of the binomial distribution is that some apps out there use the normal approximation instead of calculating the true binomial probabilities. They seem to be many of the ones that come up when you google for A-B testing. Before you use a binomial calculator on the web or download a binomial app, check if it’s giving real probabilities or the approximation. If the documentation isn’t clear, test the app with a small sample size and population rate far from 50% and compare the results to what you get with BINOMDIST() or my binomial tables. For example, the true chance of getting three or more out of 10 when the population rate is 10% is 0.0702. The normal approximation will say it’s 0.0175 or maybe 0.0569 if it’s attempting to correct for continuity.

Another pitfall I’ve seen in some on-line apps is to fail to distinguish between one-tail and two-tailed testing. Like a t-test, a binomial test can be used to calculate the probability for a one-tailed or two-tailed test. In a one-tailed test, you calculate the probability of getting either an observed count or more, or an observed count or less, but not both at once. So far, all tests in this post and in Stat 101 were one-tailed tests. For instance, in our colored punch card test, we only considered the possibility of users preferring colored cards over plain cards. We weren’t interested in the possibility that users prefer plain cards over colored cards. As far as we were concerned, that wouldn’t be any different than no preference for either cards.

In a two-tailed test, you calculate the probability of getting an observed count *as extreme or more extreme*-you determine the chance of deviating both ways away from your hypothetical state at the same time. This is the procedure you use if you’re interested in the possibility of plain cards being preferred over colored -where that would impact your UI design (or punch card policy, in this case). For example, suppose plain and colored cards cost about the same, or suppose that, while usually plain cards are cheaper than colored cards, sometimes colored cards are cheaper than plain cards. Now it makes a difference whether users prefer plain cards versus users not caring. If users don’t care, you’d buy whatever cards are cheaper at the moment, but if users prefer plain cards, you’d buy plain cards, favoring them even if they’re more expensive. Or maybe there’s an important lesson to learn if users prefer plain cards over colored -something about users being distracted or annoyed by gratuitous attempts to improve aesthetics (although there are practical reasons to use colored punch cards too). All these are reasons to do a two-tailed test -to calculate the probability of getting a count as extreme or more extreme than what you’ve observed in your sample.

If your Hypothetical State A is exactly 50%, then the binomial probabilities are symmetrical, and you get the two-tailed probability for both Hypothetical State A and B the same way you do with a t-test -simply double your one-tailed probability. In our punch card example, the two-tailed probability of getting 29 out of 40 is 0.0032 * 2 = 0.0064, which is actually the probability of getting 29 or more *or* 11 or less.

When Hypothetical State A isn’t exactly 50%, then the binomial probabilities are no longer symmetrical, and the probability of one tail isn’t quite equal to the probability of the other. The most common procedure for getting the two-tailed value is the Method of Small p-values, which, in a nutshell, means finding the precise count of events in the other tail that have a probability equal to or less than the probability of precisely the observed count, then sum the probabilities from the count you found to the end of the tail. That can mean some hunting around if you have to do it manually with a one-tailed app or function like BINOMDIST(). Fortunately, rarely in usability testing do you ever have a two-tailed test with a Hypothetical State A other than 50%.

But the real lesson from this is that you need to check if your binomial app or function is giving you one or two-tailed probabilities. Given two-tailed has about twice the probability of one-tailed, it makes a big difference. Ideally, the app should give you the choice of one-tailed or two-tailed results since you may be using either in a usability test. If the app doesn’t say what it does, you can test it by comparing results with Excel’s BINOMDIST(), which gives one-tailed probabilities of an observed count or less.

Meanwhile, back in your time warp, you’ve found another opportunity to improve the user experience of your hapless keypunching users. You noticed that a thick cable crosses the route to the card reader in the I/O room. Occasionally, users stumble on it, and then some dump their carefully ordered stacks of punch cards on the floor.

You mention this issue to Sven the BOFH, but he’s skeptical. He points out that this is 1978 and all computer users tend towards the geeky side of the population distribution -they’re not exactly known for their physical coordination, so who’s to say they wouldn’t stumble and drop their cards anyway? (You admit you have witnessed that some users are apparently capable of stumbling while standing still waiting for the card reader to read their cards.) Indeed, Sven argues that the big fat black cable is a feature not a bug: it’s such an obvious tripping hazard that it encourages the geeks to walk more carefully reducing the frequency of stumbles from what it otherwise would be. Besides, the only alternative is to route users through a different door to the I/O room, but that’s a somewhat longer walk, which increases the chance for stumbles.

You can forgive Sven for being obtuse about modern usability/human factors. It’s only with today’s UX enlightenment that you never see the virtual equivalent of a big-ass tripping hazard on a web site. Besides, to his credit, the BOFH is open-minded -anything to improve the lot of his users. He’s willing to do an A-B test. You sit outside the I/O room and, by flipping a coin, direct each user to one route or the other to the card reader, and count the number of users that stumble as they pass you. Like before, you exclude users you’ve already observed once to maintain independence of your observations.

It’s important to know if the short way is better than the long way (as Sven theorizes) in addition to the long way being better than the short way (your theory), so this will be a two-tailed test. It’s safe to say that any change that reduces any stumbling and card dumping is worth considering, so that implies a Hypothetical State A of 50% -an equal number of users will stumble whichever way they go. For Hypothetical State B, you and Sven agree that a one-third reduction in stumbling rates is definitely worth favoring one route over the other (e.g., of the users that stumble, 40% stumble on one route, and 60% stumble on the other).

Of the 100 different users you observed, 19 stumbled on the way to the card reader, a 19% rate comparable to the percent of users that have a problem in a usability test of a modern interactive application. But more of interest to you, of those 19 users, 13 stumbled taking the short route over the cable, but only 6 stumbled taking the long route through the alternative entrance. In other words, in your sample, taking the long route reduced stumbling by more than a half. So you want the probability of 6 or fewer users out of 19 stumbling assuming 50% stumble in the population.

p(6 or less out of 19 given 50%) = BINOMDIST(6,19,0.5,TRUE) = 0.0835

This is a two-tailed test, so the p-value for observing something as extreme as 6-or-less out of 19 is 0.0835 * 2 = 0.1671.

There’s a pretty reasonable chance of seeing results like this when the true population rate is 50%. You don’t have to calculate the p-value for Hypothetical State B which stipulates only a one-third reduction. Given your observed one-half reduction, you’re not in the Whatever Zone where it seems like it makes little difference which route your users take. You start planning to run more users.

Then the ever-helpful BOFH notices something in your data: of the 100 users you observed, 45 were sent the short way over the cable, and 55 went the long way. There’s no good reason to think there’s something wrong with the coin you were tossing. If you calculate the binomial probability for it, you’ll see it’s quite plausible to get deviations at least that large with a fair coin. But nonetheless you happened to send fewer users over the cable than around it. Sven reasons that if the route really makes no difference, then if you send 55 of *all* 100 users the long way, then 55%, not 50%, of your 19 *stumbling* users should have gone the long way.

Sven is right. You’re using the wrong population rate for Hypothetical State A. And you’re not the only one. Your hypothetical rates need to reflect the base or marginal exposure rates. If there is no effect, then the population rate of a condition is equal to the proportion of users exposed to that condition. Failing to correct for the base rates -for the number of users in each condition of an A-B test -can give you p-values that are bigger or smaller than they really are, depending on which condition gets the greater number of users. Failure to take into account the base rate is a common error not only among students of statistics but in less mathematically formal problems in our lives.

It’s easy enough to take into account the base rate with binomial probabilities. In this case, you sent more than half the users on the long route, which means you accentuated the exposure to stumbling on the long route, so your p-value is too small. For the number of users stumbling on the long route, Hypothetical State A is 55%, not 50%. Redoing the binomial probability:

p(6 or less out of 19 given 55%) = BINOMDIST(5,19,0.55,TRUE) = 0.0342

The p-value is down on one tail, as expected. But now you’re pissed, because that stupid blogger told you you’d wouldn’t have to do two-tailed binomial tests in usability for anything other than a 50% population rate, and here it is, only a few paragraphs later and you have to do precisely that.

Trust me here a minute.

If you were to follow the Method of Small p-values, you’d find the opposite extreme of 6-or-less out of 19 is 15-or-more out of 19 (the probability of precisely 15 out of 19 is 0.0203, just under the probability of precisely 6 of 19, which is 0.0233). So the other tail’s probability is:

p(15 or more out of 19 given 55%) = 1 – BINOMDIST(14,19,0.55,TRUE) = 0.0280.

So, the two-tailed p-value for seeing something as extreme as 6-or-less is 0.0342 + 0.0280 = 0.0622. Thank you, BOFH. That may to you look like an acceptable Type I error rate right there, leading you to conclude there be less stumbling if users always take the long route. However, it’s not good enough for Sven. Most of his users have scientific and mathematical backgrounds. They’re not going to be happy with him blocking the short route unless he can put the stamp of Statistically Significant on the results. Sven wants to see the p-value go below 0.05. So you’re back to running more users, although it’s not as bad as it was before.

While you were doing all these calculations by hand, Sven whipped up a Cobol BINOMDIST subroutine, and is enjoying printing out his own tables of binomial probabilities. He confirms your 0.0622 two-tailed probability for 6-or-fewer out of 19. Now he decides to look at those who took the short route, of whom 13 out of 19 stumbled, and the base exposure rate was 45%.

p(13 or more out of 19 given 45%) = 1 – BINOMDIST(12,19,0.45,TRUE) = 0.0342

That’s a familiar looking number. And sure enough, using the Method of Small p-values, he finds the other extreme is 4 or fewer:

p(4 or less out of 19 given 45%) = BINOMDIST(4,19,0.45,TRUE) = 0.0280

An exact mirror image of your results, with the same total p-value. Of course. Logically, they should have the same probability since you can’t have 6 out of 19 stumbling on one route without also having 13 out of 19 stumbling on the other route. 6-or-less out of 19 given 55% is synonymous with 13-or-more out of 19 given 45%. The same situation must have the same probability no matter how you state the problem. Math is truth and beauty.

Just for fun, Sven now calculates the probability of observing the users *not* stumbling when they take the long route. Nineteen of your 100 users stumbled, which means 81 didn’t stumble. You sent 55 users the long way, of whom 6 stumbled, so 49 didn’t stumble. Logically, the probability of 49-or-more out of 81 should be the same as 6-or-fewer out the 19 since they are simple restatements of the same situation.

p(49 or more out of 81 given 55%) = 1 – BINOMDIST(48,81,0.55,TRUE) = 0.1891

Wait of minute. Sven has only done one tail so far, and his p-value is already far larger than what you got for the users that stumbled. Following the Method of the Smallest p-value, Sven finds the other tail starts at 40.

p(40 or less out of 81 given 55%) = BINOMDIST(40,81,0.55,TRUE) = 0.1827

Total p-value = 0.1891 + 0.1827 = 0.3718

Two very different p-values for the same situation. One of them has to be wrong. Do you go with the users who stumbled or those that don’t? Which is right, 0.0622 or 0.3718? Why?

The answer of course, is that neither 0.0622 nor 0.3718 is right. The problem with both p-values is that they are using only part of the information you have available. What you need is a procedure that calculates the probability of the entire situation including both those that stumbled and those that didn’t. You want to include all the information in this 2-by-2 cross-tabulation table:

Route | |||

Short | Long | ||

Response | Stumble | 13 | 6 |

Not Stumble | 32 | 49 |

Looking at it this way, you see now that you really have *two* binary variables in your usability test -two separate sets of categories to classify your users. One is the route you assigned each user (long or short route) and the other is the *response* of the user (stumble or not). The binomial test is inadequate for this situation because it only has *one* binary variable. As is often the case when you use less than all the information available, you lose statistical power or “decision potential.” The p-values both you and Sven calculated are larger than they would be if you used the information from both binary variables at once.

This appears to be a common error in A-B testing. I can only guess they get away with it because the error in the p-value is small when the total number of conversions is much smaller than the total number of non-conversions, which is pretty typical. But why be even a little wrong? Use your computer to compute. Use all the information you have.

The procedure to calculate a p-value for the joint counts from two binary variables is Fishers Exact Test (not to be confused with Fisher’s F or Fisher’s Z; that guy Fisher did a lot of great stuff; probably had a lot of groupies too). While the binomial test calculates the probability of observing two mutually exclusive counts (failures and non-failures), Fisher’s Exact test calculates the probability of four mutually exclusive counts from cross-tabulating two binary variables with each other. Like the binomial test, you first calculate the probability of precisely the observed configuration of counts, and then you add to that all other more extreme configurations for either one-tail or two-tails, depending on your intentions with the usability test.

Unlike a binomial test, however, Fisher Exact tests for only one hypothetical state: that the proportions of each cell are equal to the base rate. To put it another way, the hypothetical state is that there is no correlation between your two variables that characterize your users -for example that the tendency to stumble is unrelated to the route a user takes. To put it yet another way, the hypothetical state is that the two binary variables are *independent*. By “independent” I mean the same thing we mean when we talk about users performing independently. When users are independent, the chance of one user failing or succeeding is unrelated to whether another fails or succeeds. When binary variables are independent, then the chance of a user falling in a particular category of one variable is unrelated to his or her category on the other. The chance of stumbling is not different whatever route the user takes, for example.

A state of uncorrelated or independent variables is usually what you want for Hypothetical State A. It’s what we meant to have for this case of alternative routes to the card reader. However, since that’s the only hypothetical state that Fisher’s Exact tests, you cannot test Hypothetical State B with Fisher’s Exact. At least, I haven’t figure out a way yet. Like I said, statistical analysis of counts is surprisingly complicated.

To calculate Fisher’s Exact, first figure your row and column totals.

Route | ||||

Short | Long | Total | ||

Response | Stumble | 13 | 6 | 19 |

Not Stumble | 32 | 49 | 81 | |

Total | 45 | 55 | 100 |

Now consider the users you sent on the long route. Of the 100 users, how many different ways are there to divide them up so that there are 55 going the long way and 45 going the short way? To answer that we use the same formula for combinations that we used to calculate binomial probabilities: number of combinations = *n*! / (*f*! * (*n* – *f*)!). From now on, I’ll abbreviate that formula as COMBIN(*n*, *f*), which also happens to be the Excel function for calculating combinations. You use either the 45 or 55 for *f* since it means the same thing, and gives the same answer. That is, if there is* x* number of ways to get 45 out of 100 on the short route, there must also be the same x number of ways to get 55 out of 100 on the long route since every way of getting 45 on the short route is also a way to get 55 on the long route.

So, the answer to the question is:

COMBIN(100,55) = 6.1448e28 (i.e., a really big number)

Now of 19 total users that stumbled, how many different ways are there to split them into 6 of them taking the long route and 13 of them taking the short route?

COMBIN(19,6) = 27,132 (again, it doesn’t matter if I use 6 or 13, but I’ll use 6 to be consistent)

And of the 81 that didn’t stumble, how many ways are there to get 49 of them in the Long Route cell?

COMBIN(81,49) = 3.6219e22 (lots smaller than the first number, but still pretty impressive)

For every one of those 27,132 ways of filling the stumbler row there are 3.6219e22 ways to fill the non-stumbler row, so you can get the total number of different ways of filling both rows to arrive at the counts we’ve observed by multiplying the two ways together:

27,132 * 3.6219e22 = 9.8269e26

Divide that by the number ways of getting 45 and 55 total users on the short and long way, and that’s the proportion of ways to get the observed counts in each row given the totals of each column. It’s therefore the probability of getting the observed counts.

p = 9.8269e26 / 6.1448e28 = 0.01599

That’s the probability of the precise configuration of counts you observed.

As a general formula, the probability of a particular two-by-two crosstabulation table of counts is:

p = COMBIN(*tr*1,*f*1) * COMBIN(*tr*2,*f*2) / COMBIN(*tc*,*t*)

Where:

*tr*1 is the total for the first row.*tr*2 is the total for the second row.*f*1 is one of the counts from the first row.*f*2 is one of the counts from the second row.*tc*is one of the column totals*t*is the total sample size.

That only gets you the p-value of precisely the observed configuration of observations. Now, similar to what you do with a binomial test, you need to calculate the p-value for each way of having a stronger relationship between stumbling and route while keeping the column and row totals the same. For example:

Route | ||||

Short | Long | Total | ||

Response | Stumble | 14 | 5 | 19 |

Not Stumble | 31 | 50 | 81 | |

Total | 45 | 55 | 100 |

And:

Route | ||||

Short | Long | Total | ||

Response | Stumble | 15 | 4 | 19 |

Not Stumble | 30 | 51 | 81 | |

Total | 45 | 55 | 100 |

And so on.

If you’re doing a two-tailed test, then you also have to use the Method of Small p-values to find the corresponding configurations in the opposite tail, where there is proportionally more stumbling on the long route than the short route, and add in their p-values too.

For our stumbling data, that all sums up to a p-value of 0.0386. How do you like that? All this time your results met Sven’s requirement for statistical significance. It was just a matter of finding the right test.

I went through all that math above just so you know what Fisher’s Exact is up to under the hood. In practice you use an on-line or mobile app. I can recommend the one at GraphPad Software and one by Øyvind Langsrud. Unfortunately, Excel doesn’t have a Fisher Exact function.

Just like you can use the normal approximation of the binomial distribution to estimate the p-value from a the binomial test, you can also use the normal approximation to estimate the p-value for the relationship between two categorical variables. That approximation is the *chi-square* (not “chi-squared”) test. It uses the chi-square distribution, which is sort of a normal distribution for multiple variables combined.

In general a chi-square test is useful for testing how well some observed values fits expectations. To use chi-square to test the independence of two categorical variables, we calculate the counts we would *expect* to get in each cell if the hypothetical state of independence were true. For example, overall 19/100 or 19% of your users stumbled on the way to the card reader. If stumbling is independent of route, then you would expect that of the 55 users that took the long route 19% or 10.45 would stumble (on average).

Expected count of stumbling users on long route = 55 * 19/100 = 10.45

Likewise, the expected number of user stumbling on the short route is the total proportion of user stumbling times the number who took the sort route:

Expected count of stumbling users on the short route = 45 * 19/100 = 8.55

Now do the same calculations for those *not* stumbling. The expected number of users not stumbling on the long route is the total proportion of users not stumbling times the number of users on the long route:

Expected count of non-stumbling users on long route = 55* 81/100 = 44.55

And likewise for the short route:

Expected count of non-stumbling users on the short route = 45 * 81/100 = 36.45

So you see a pattern here. For any cross-tabulation table of two categorical variables, if the variables are independent (uncorrelated), then the expected count *fe* for a cell in the *r*th row and *c*th column is:

*fe* for cells in row *r* and column *c* = *tc* * *tr* / *t*

Where *tc* is the total for column *c*, *tr* is the total for column *r*, and *t* is the grand total (your sample size).

To get the chi-square statistic, calculate the following for each cell:

2 * *f* * LN( *f* / *fe* )

Where *f* is the count you observed in the cell and LN() is the natural logarithm function. This form of calculating chi-square is known as the “G-test” to distinguish it from an older somewhat less accurate calculation. While the G-test is more accurate than the alternative, it is still using a normal approximation. For our data we get for each cell:

2 * 6 * LN(6 – 10.45) = -6.66

2 * 13 * LN(13 – 8.55) = 10.89

2 * 49 * LN(49 – 44.55) = 9.33

2 * 32 * LN(32 – 36.45) = -8.33

The older alternative calculation is suitable when you have no spreadsheet software or scientific calculator. It’s:

(*f* – *fe*)^2 / *fe*

Or:

(6 – 10.45)^2 / 10.45 = 1.89

(13 – 8.55)^2 /8.55 = 2.32

(49 – 44.55)^2 / 44.55 = 0.44

(32 – 36.45)^2 / 36.45 = 0.54

The chi-square statistic is the sum of these numbers:

Chi-square = -6.66 + 10.89 + 9.33 + -8.33 = 5.23

Or, doing it the old-fashioned way:

Chi-square = 1.89 + 2.32 + 0.44 + 0.54 = 5.20

Doesn’t make much difference, especially since it’s all an approximation anyway.

The p-value of a chi-square can be found with the CHIDIST() function in Excel. The degrees of freedom for a cross-tabulation of two binary variables is 1.

*p* = CHIDIST(5.23,1) = 0.0222.

*p* = CHIDIST(5.20,1) = 0.0226.

The chi-square test is always two-tailed, so that’s your final p-value. As you can see, it’s only approximately equal to the true p-value of 0.0386 that we got with Fishers Exact. Like the normal approximation of the binomial distribution, the larger the sample size and the more evenly the counts are distributed (i.e., the more the row and column totals are close to each other), the more accurate it is. In the case of our stumbling data, the sample size is pretty big, but the row totals are very different from each other, which accounts for the inaccuracy. A small count in one or more cells of the cross-tabulation table, like the of 6 stumblers on the long route, is a warning sign to avoid chi-square.

Also like the normal approximation of the binomial, there’s no particularly good reason to use the chi-square test these days when you can use Fisher’s Exact. The main reason I’m covering Chi-square is the same reasons I covered the normal approximation of the binomial:

- It’s handy if for some reason you don’t have access to an app for Fisher’s Exact. The formula is sufficient simple that you whip it up in a spreadsheet pretty quickly -much easier than doing Fisher’s Exact in a spreadsheet.
- Some on-line apps for calculating the p-value for the relation between two binary variables use chi-square rather than Fisher’s Exact, so you need to be aware when you’re only getting an estimate.

There is one other reason to know about the chi-square test, and that is it’s easily scalable beyond binary variables. For instance, you can use it with three or more different designs (e.g., three different ways of warning users about the cable), or three or more different user response categories (e.g., stumble, not stumble, and turn around and not go to the card reader). You merely add rows and columns of count data to your cross-tabulation table. You still calculate *fe* the same way for each cell, and still sum up [2 * *f* * LN( *f* / *fe* )] for all your cells to get your chi-square statistic. You’ll just be doing it to more than four cells.

The only change is to your degrees of freedom. For any cross-tabulation table, the degrees of freedom for a chi-square test is:

*df* = (*r* – 1) * (*c* – 1)

Where *r* is the number of rows and *c* is the number of columns in your table. For two binary variables:

*df* = (2 – 1) * (2 – 1) = 1

For a usability test with 4 difference designs and 3 different user response categories;

*df* = (4 – 1) * (3 – 1) = 6

The chi-square test is sufficiently simple you can use it for such elaborate usability tests. Fisher’s Exact, in contrast, gets very complicated quickly when you scale it to more than binary variables. You wouldn’t want to do it by hand, even in a spreadsheet.

A single-table Chi-square analysis is the appropriate way to do so-called multivariate testing, rather than breaking the test down into a series of 2-by-2 A-B tests. Consider: if your level of comfort of a Type I Error is 0.05, and you run 6 separate A-B tests, what’s the chance of at least one leading to a design decision when there really is no difference in conversion rates among any of them? Well, that’s just another binomial probability -sort of a meta-binomial probability, but it’s the same calculation:

p(1 or more, given a 5% rate) = 1 – BINOMDIST(0,6,0.05,TRUE) = 0.265

That is, you’re actual chance of making a Type I Error is better than 1 in 4, more than five times greater than if you did it all as a 2-by-6 Chi-square. If the p-value for the 2-by-6 Chi-square is low enough for you, indicating some kind of correlation between designs and responses, *then* you can do a series of 2-by-2 tests (using Fisher’s Exact) in order to check which design is apparently better than the other.

You might be wondering if all this is just academic. Binomials, chi-square, Fishers Exact, they all pretty much say the same thing. Let’s return to the results at DIY Themes, where the Visual Website Optimizer calculator provided a p-value of 0.051 for getting 8 conversions out of 793 on one design versus 15 out of 736 on the other.

Dig into the Javascript for the calculator and you discover:

- It performs a one-tailed test. That’s wrong because the very fact that people are debating the counter-intuitive finding tells me everyone has an interest in either design out-performing the other.
- It uses the normal approximation, rather than exact probabilities. That’s very suspect with some cells having such low counts.
- It estimates the binomial probability for just the number of conversions, ignoring the information supplied by users that did
*not*convert.

Plug the same numbers in to a Fisher Exact two-tailed test and you discover the true p-value is 0.14, almost three times greater than what Visual Website Optimizer reported. There’s a one in seven chance of seeing results like this when there is actually no difference between the designs. Personally, that is way too much of a chance of error for me to waste time trying to understand why this form of social proof allegedly backfired. Call me when there’s more data.

**Potential Solution**:

- Use the binomial probability distribution for independent counts in two categories.
- For preference counts of a binary choice, set Hypothetical State A to 50%.
- Select the appropriate number of tails.
- Check the number of tails your binomial app or function uses.

- Avoid using the normal approximation.

- Use Fisher’s Exact test for A-B test results.
- Select the appropriate number of tails and be sure your app supports it.

- Avoid tests that ignore the base rate.
- Avoid tests that ignore the number of non-conversions.
- Avoid tests the use a normal approximation for a simple 2-by-2 cross-tabulation. These include: Chi-square and the G test.

- Use Chi-square tests of independence (preferably the G-test) when testing more than two designs.
- Test all designs simultaneously in a single large cross-tabulation table.
- If the single large test indicates a relationship, then do 2-by-2 tests to compare designs with each other.

Here’s a spreadsheet of this post’s data and analyses.

]]>- Comparing two new proposed designs.
- Two-tailed versus one-tailed t-tests.
- Maximizing power with experimental design.
- Within-subject testing.
- More on the normality assumption.
- Data transformations.

** **

**Prerequisites**: Stat 101, Stat 201, mild mad scientist tendencies.

Yes! You got the massive Surfer Stan web site contract. Now you’re trukin.

The biggest challenge for the site redesign is Paint Your Board, a new feature for Stan’s site. Stan has invested in a computer numerically controlled (CNC) four-color surfboard painter. Stan got it so users can upload any image they want to appear on the boards they’ve chosen for purchase. You’ve made two interactive prototype UIs for Paint Your Board. One is an easy-to-understand but awkward-to-use multi-page HTML wizard, where users select the image’s size, position, orientation, and distortion on their boards through a combination of input fields and Submit buttons. Early testing showed this requires a lot of trial-and-error and page loads to get the results the users desire. It’s all very time-consuming.

The second prototype is a Flash application where a user drag-and-drops the chosen images over the board and sees the result instantly. In theory this will be several times faster than the HTML approach (not to mention being way rad). However, you know that drag and drop isn’t terribly discoverable, so the question is, will the improved speed of Flash make up for the time it takes to learn it? Time for another usability test. This time, the question is, which is faster on average, HTML or Flash?

Let’s select our Hypothetical State A. What true difference in times is definitely *not* worth favoring one design over another? This is a different situation than last time when comparing an existing home page with a proposed new one. Before you already had a working web site, and were trying to see if it was worth the cost of redesigning a new site. Now both designs are equally complete and at least partially operational (to do the usability test, you had to prototype both designs). However hard it was to make each version, that money and time is already spent. Assuming the future differences in development, installation, and maintenance cost is negligible, there is no additional cost associated with going with one design over the other. So *any* difference is worth favoring one design over another. You have to choose one of them. If you can conclude that one is even a tiny smidgen better than the other, then that’s the one to go for. So:

Hypothetical State A: The difference in the completion times is zero.

If we observe a difference that is implausible assuming the real difference is zero then we can confidently favor one design over the other. Significantly (no pun intended), we will favor one design over the other for *any* implausible difference, either HTML being better than Flash or vice versa.

Last time with the home page designs, we only considered if the new home page was better than the old home page. We never checked to see if the old design was sufficiently better than the new design. We calculated p-values for Hypothetical State B to see how sure we could be that the new design was not considerably better than the old design, but that didn’t tell us if the old design could be considerably better than the new. It made sense to only consider the new being better than the old because it was really irrelevant if the old design is better than the new one. I mean, who cares? As long as you’re reasonably sure the new design wasn’t considerably better than the old, you were not going to go with the new design. Concluding that the old design is better than the new didn’t add useful information to your decision-making process.

Now we are considering *either* difference, simultaneously both HTML being better than Flash and Flash being better than HTML. When we calculate the p-value, it’s not a question of the probability of getting an observed difference so far above Hypothetical State A. It’s not a question of probability of getting an observed difference so far below Hypothetical State A. We need the probability of getting an observed difference so far from *either* side of Hypothetical State A. We need the probability of a difference in completion times as *extreme or more extreme* as what we observe.

For example, suppose we get a difference in completion time of 50 seconds. Let’s say the standard error is 25 seconds, Our p-value for that t-statistic should represent the probability of getting both 50 seconds or higher above zero, and 50 seconds or lower below zero (that is, -50 seconds or lower), The t-statistic, which converts our deviation from the hypothetical state from seconds to number of standard errors, is:

*tstat* = (*diff* – *hypo*) / *se*

Hypothetical State A is zero difference, so the corresponding t-statistics would be:

*tstat* = (50 – 0) / 25 = 2.0 for 50 seconds and higher,

*tstat* = (-50 – 0) /25 = -2.0 for 50 seconds and lower.

So in units of standard error, we want the p-value of getting 2.0 or higher and -2.0 or lower. The sampling distribution would look like this:

This is called a two-tailed test because there are time-differences in each tail of the sampling distribution that would represent Type I errors that we need to include in our probability. We say we’ll favor one design if the difference is implausible assuming Hypothetical State A is true, but that really represents two ways to make a Type I error -two ways to favor a design when there is really no difference, so we have to include the probabilities of both extremes to get the total probability of getting a Type I error. We do this by simply adding the probabilities from both tails together. We’ll add the probability of getting a difference of 2.0 or more with the probability of getting a difference of -2.0 or less. Or, because the t-distribution is symmetrical, we’ll double the probability of getting 2.0 or more (especially since Excel refuses to take negative t-statistics).

We can get the p-value of out t-statistic using Excel’s TDIST() function, which requires the t-statistic, the number of tails, and the degrees of freedom. Let’s say our degrees of freedom (which are directly related to the sample size) is 10. TDIST(2.0, 10, 1) = 0.037 is the probability in one tail, so the probability in two tails is 0.037 * 2 = 0.073.

Or, we simply tell Excel we want to do a 2-tailed t-test, and Excel will double the p-value for us: TDIST(2.0, 10, 2) = 0.073.

You may wonder if we even need Hypothetical State B since two-tailed p-values for Hypothetical State A can tell us both if we should choose Flash over HTML or if we should choose HTML over Flash. We still need Hypothetical State B so we can tell with high certainty if there is no considerable difference between Flash and HTML, as opposed to failing to *detect* a potentially important difference because your sample size is too small. By “considerable” in this case, I mean so large we should definitely choose one design over the other.

So now you and Stan set Hypothetical State B. What true difference in times is definitely worth favoring one design over another? Talking it over with Stan, you decide a 20% shorter completion time or better. Or to put it another way, if the faster version takes less than 80% of the time of the other, it’s definitely the design to go for. For example, if using one design takes five minutes and the other design takes four minutes.

Stan’s reasoning is that 20% less time to do something is enough to be noticeable, so that’s definitely worth favoring one design over the other. You could’ve agree on a certain number of minutes or seconds, but, not knowing how long the Paint Your Board tasks takes (it being a new feature), Stan couldn’t get tight with such numbers. Besides, having Hypothetical State B as a ratio is going to save us some arithmetic complications, as it turns out.

We have two-tails for Hypothetical State B too, since we need to calculate the p-value of *any difference* assuming the real difference is 20%; that is, HTML-over-Flash or Flash-over-HTML being 80%.

Doing a two-tailed t-test has nothing to do with completion time being our measure. You can do a two-tailed test for ratings or any other measure.

What matters is whether you’re preparing to check for one or two possible outcomes from each hypothetical state. If you are strictly interested in the possibility of Design X being better than Design Y, it’s a one-tailed test. If you are even remotely interested in Design Y being better than Design X in addition to Design X being better than Design Y, then it’s a two-tailed test.

It’s important to firmly decide whether you want to do a one-tailed or two-tailed test because each has advantages and disadvantages, and you can’t change your mind after you see how the users perform.

The advantage of the two-tailed test is that it tells you more information -for Hypothetical State A alone, it’ll tell you if the X is better than Y *and* if Y is better than X. The one-tailed procedure can only tell you if X is better than Y. However, the one-tailed test has more power or “decision potential,” as I like to say. You’re focusing all your resources into only one possible outcome per Hypothetical State, getting more out of it. Because a one-tailed procedure doesn’t split our Type I and Type II error probabilities each into two pieces, you can make conclusions with less extreme t-statistics. For example, with a one-tailed test you need a t-statistic of about 1.3 (the exact value depending on your degrees of freedom) to have a 0.10 probability of an error. With a two-tailed test, you’d need a t-statistic of about 1.7 to have the same probability.

A less extreme t-statistic can come from a smaller sample sizes (i.e., bigger standard errors), so you can make decisions with faster and cheaper usability testing.

Given that usability tests already tend to have small sample sizes, the greater decision potential of a one-tailed test is a major advantage. However, you have to decide to use a one-tailed test before you see which design is on average better than the other in your sample. If you wait and see that Flash is faster than HTML, then decide, “well, I’ll do a one-tailed test on Flash being better than HTML,” you’re really doing a two-tailed test with double your apparent Type I error rate, because if HTML came out faster than Flash you would have done precisely the opposite. It’s like betting “heads” if a coin-toss comes up tails, and betting “tails” if it comes up heads. You’re doubling your chance of losing from 0.5 to 1.0.

Likewise, you can’t start out planning a one-tailed test then switch over to two-tailed once you see the data going opposite of your one-tailed test. Given Stan can tolerate a 0.10 chance of a error, switching from a one-tailed to a two-tailed after seeing the averatges means having a 0.10 Type I error rate if the data had gone as expected, plus a 0.05 Type I error rate for how the data actually went, for a 0.15 total Type I error rate. Bummer for Stan.

The bottom-line is use the one-tailed procedure when there is some added cost associated with going with Design X over Design Y, such as we had in the rating scale situation, where going with Design X (the new one) meant Stan spending duckets he wouldn’t have to otherwise. In other words, use a one-tailed test when Design Y being better than Design X is practically equivalent to Design X not being considerably better than Design Y.

It’s a pretty straight-forward usability test: get the average difference in time it takes to virtually paint a board with each prototype. However, you learned your lesson from the test with the ratings. You don’t want to once again be in a situation where both hypothetical states were reasonably plausible given your initial small sample size.

That meant you couldn’t make any design decision with adequate confidence. We want to maximize our decision potential (or statistical power, broadly speaking) to stay out of the statistical mush -of being unable to make a decision at acceptable Type I and Type II error rates. For a given usability test, we want to have the lowest Type I and Type II error rates regardless of the difference we see in the sample. Going back through Stat 101 and 201, we’ve already touched on three ways to help decision potential:

**Large sample size***.* The larger the sample sizes the more accurate the statistics from the sample; that is, the smaller their sample error. The smaller the sample error, the higher your confidence that any deviation from either hypothetical state represents the true population values. In fact, as we saw in Stat 201, if we know the averages and standard deviations to expect from the sample, we can calculate the sample size we need to have reasonably good decision potential.

In the case of these completion times, you don’t have estimates of the standard deviations so you can’t estimate the sample size you’ll need. Earlier testing sessions of previous design iterations included a lot of stopping the task to talk with the users to get qualitative feedback, so it didn’t make sense to try to time the users. But here are some guidelines on choosing a sample size when you don’t have those statistics: Sample sizes of 5 to 10 users are generally only adequate when you expect one design to blow the other out of the water -like one design having half the completion time as the other, a result you can see coming in earlier design iterations even allowing for imperfect measurement. However, your sense from this formative testing is that the difference between HTML and Flash is going to be close. Alternatively, small sample sizes are adequate when there is very little difference between the two designs *and* your Hypothetical State B specifies a radical difference in performance. If you’d need one design to take half as long as the other to be definitely worth it, then 5-to-10 users would be a good starting point. You and Stan, however, see 80% to be definitely worth it -not that big of a difference in completion time. For our example, go big: start with 20 users and be prepared to add more users if you get mushed.

**High-powered statistical test**. Some statistical procedures have more decision potential than others. As a general guide, the more information you use from the data and testing process, the greater the decision potential. For comparing averages, the t-test has just about the most decision potential of any test you’re going to do on a spreadsheet, so we’re already doing the best we can in this regard. Simple alternatives to the t-test use less information in the data. For example, tests like the Mann-Whitney U and Wilcoxon Signed-rank use the rank order of the completion times rather than the actual completion time to the nearest second. Other tests, like Chi-square or the sign test simply classify the completion times as relatively “slower,” and “faster.” Anytime you reduce your data in this way, you’re losing decision potential. The main reason to consider such alternatives to the t-test is if either the data is already reduced (e.g., that’s how you got it from the client), or if the normality assumption cannot be met.

**One-tailed Test**. As we just covered above, a one-tailed test has greater decision potential than a two-tailed test. However, we’re interested in either Paint Your Board design being faster than the other (or either board not being considerably faster than the other), so we’re doing a two-tailed test, which makes it all the more important to compensate the loss in decision potential.

Let’s look more closely at the t-test and see if there something else we can do. To get high decision potential, you need large t-statistics for either Hypothetical State A or B. The numerator for the t-statistic represents the deviation of the sample from the hypothetical state, and that’s going to be what that’s going to be. The denominator of the t-statistic is the standard error. You’ve got to keep your standard error down. Big sample sizes do that, of course, but they’re expensive and time consuming so you don’t want them to be any bigger than they have to be -you remember quite well how you missed a couple beautiful days at the beach because you had to nearly quadruple your sample size for the ratings test in Stat 201. What else can you do?

The only variable other than sample size in the formula for the standard error is the sample standard deviations. You need to make them as small as you can. You want to minimize the variability in the numbers you conduct your t-test on. There are several ways of doing this:

**More similar users**. If all your users had the same level of skill and knowledge, that would reduce variability. This is why scientists do experiments on genetically inbred white rats raised in controlled environments -they’re all very much alike in whatever rat talents and skills they have, so that minimizes variability, allowing smaller sample sizes. However, limiting the range of skill and knowledge of you users would make your sample unrepresentative of your population of users. Heck, run a bunch of white rats on your Paint Your Board feature, and they’ll all do equally well (poorly), but it won’t mean anything for your users. What you can do is more carefully select your users to make sure you aren’t getting any who are truly unrepresentative of your users. For example, you want to be sure not to include any total kooks who keep confusing the top of the surfboard from the bottom (”which way do the fins go again?”), adding to their task completion time.

**More tightly controlled task**. You get less variability if you focus your usability test task on precisely the thing you’re testing, and exclude all else. For example, the UI to upload the image for the board is the same for both the HTML and Flash version, so don’t start the stopwatch until after the users upload their image. Otherwise users who get lost in their own file hierarchies have slow completion times relative to those who rip through their folders. Including the image-uploading won’t change the difference in the average completion times because by random chance you can expect equal number of hierarchy-lost users in each group, so they cancel out. However, it increases the variability of the completion times. Likewise you want to eliminate other sources of extraneous variation. Make sure your instructions are clear and consistent for all users so you don’t have any taking unusually long times because they misunderstood something. Make sure there aren’t any interruptions once the task is started. Keep the users on task no matter how much they want to wander off and explore other features of the site.

**Within Subjects Testing**. You substantially reduce variability by having each user try *both* designs and look at the change in completion time of *each* user. This is call a “within subjects” test (or a “repeated measures” experiment) because you’re varying the UI design within each user or “subject.”. The alternative is “between-subjects” testing, which is what we had with the ratings in Stat 201. There you varied the UI design between two separate groups of users. Within-subjects testing has the effect of factoring out individual average differences, reducing variability. Users who tend to be fast on both prototypes are no longer so different from users who are slow on both prototypes because we now use the *change* in completion times.

Within-subject testing is great for increasing decision potential, but it has a couple disadvantages. First of all you need more time per user, so you probably have to compensate them more. However, that’s usually cheaper than having to recruit more users, given the overhead of identifying suitable ones. As long as it doesn’t make the usability test so long no one wants to do it, that isn’t a major problem. Usually, you’ll get more decision potential from a within-subject test of a certain number of users than you would with a between-subjects test of twice as many users. Both approaches have the same total amount of user time, so the compensation cost is about the same, but within-subject gives you more decision potential for the compensation buck. Thus, even if the only cost were paying users for their time, you’re still usually better off with a within-subjects test.

A more serious problem is order effects. There might be differences in performance merely due to users trying one prototype before the other. For example if the users try the HTML version first then the Flash version, they might be faster with Flash simply because they got some practice painting their board, not because Flash is better. Or if the HTML version were faster, then maybe users were getting fatigued or bored by the time they got to paint their board (again) with the Flash version. The solution is to balance out the order effects by randomly assigning exactly half of your users to try the Flash version first while assigning the other half to try the HTML version first. This is called *cross-over* within-subject testing.

Within-subject testing should be considered for all usability tests regardless of the measure. We could’ve done it for the test comparing ratings of the new and old home page (*now* I tell you). That not only would’ve lowered the standard error, but also would’ve encouraged users to make contrasting ratings for the two versions if they believed one was better than the other. As always, you want to balance out the order effects by having exactly half your users seeing one version first while the other half sees the other first. Who knows? There may be primacy preference effects where users tend to like the first thing they see the most. Or maybe recency effects where they assume the second thing they see must be “improved.” A cross-over test eliminates such concerns, pre-empting any tedious (but legitimate) arguments over data interpretation.

The only time to not use within-subjects is when you’re likely to get extreme order effects -when doing the task the first time effectively negates the meaning of doing the task again. For example, say you’re testing two menu designs to see how they communicate a web site’s information architecture. Users are going to at least partially learn the IA with the first menu design they use, so they’ll be much better when they try the second one. In fact there may not be much new learning of the IA at all so it’s no longer reasonable to compare the two.

Finally we get to run the usability test. Here’s your data:

Data | HTML | Flash | |

User | Time | Time | Difference |

1 | 209 | 280 | 71 |

2 | 272 | 209 | -63 |

3 | 158 | 155 | -3 |

4 | 392 | 305 | -87 |

5 | 392 | 448 | 56 |

7 | 170 | 305 | 135 |

8 | 289 | 251 | -38 |

9 | 479 | 600 | 121 |

10 | 258 | 140 | -118 |

11 | 600 | 502 | -98 |

12 | 245 | 247 | 2 |

13 | 265 | 196 | -69 |

14 | 182 | 177 | -5 |

15 | 296 | 239 | -57 |

16 | 318 | 231 | -87 |

18 | 303 | 238 | -65 |

19 | 269 | 382 | 113 |

20 | 383 | 329 | -54 |

Statistics | |||

Sample | 18 | 18 | 18 |

Average | 304.4 | 290.7 | -13.7 |

Std Dev | 111.9 | 123.1 | 80.3 |

Skew | 1.14 | 1.22 | 0.71 |

Two of your users were “No Shows” who didn’t arrive for the usability test, so you have 18 rather than 20 users. With a within-subjects test, we’re only interested in the statistics from the difference of the completion times (in seconds) for each user. In this example, we’ll arbitrarily make each user’s difference be the Flash design completion time minus the HTML completion time, so a negative number means that a user was faster with Flash and a positive number means a user was faster with HTML (it won’t make a difference if it were HTML minus Flash as long as you keep track what positive and negative mean).

With a negative average of the difference scores, we see that on average most users are faster with Flash, but only by 13.7 seconds. With within-subjects testing the standard error is still derived from the sample standard deviation, but purely from the standard deviation of the difference scores. The formula for within-subjects testing is:

*se* = *sd* / SQRT(*n*)

Where *sd* is the standard deviation of the differences and *n* is the number of users (or number of difference scores). With the data above:

*se* = 80.3 / SQRT(18) = 18.94

Without even doing the t-tests we have a sense where things are going: The average difference of -13.7 seconds is well within one standard error of 0 (our Hypothetical State A), so our data is consistent with no difference between Flash and HTML. We should expect it to be pretty plausible that we get -13.7 seconds assuming Hypothetical State A is true. But we’ll still do the math.

As always, we also take a look at the distribution of the sample. It’s more-or-less bell-shaped, but it’s lopsided. Looking at the completion times for either HTML or Flash, we see most numbers are bunched up at the relatively low end while there’re relatively few at the high end. Most of your numbers are below average, which sounds like Lake Wobegon in some dark parallel universe, but its perfectly possible mathematically, and happens to be true in this case. The difference scores are not so badly lopsided, but I wouldn’t trust the distribution you get from a sample of difference scores.

This is called *skewed* data, and it’s common for the times to complete an action. It makes sense: there’s a limit to how fast a user can go -they certainly can’t complete a task in less than zero seconds, but there is no limit to how slow they can go. The effect on the differences in completion times, which is what we really care about, is not entirely predictable. Data may end up lopsided one way or the other, or not at all.

You can get a sense of skewness by just eyeballing how the data is distributed around the average, but there are also formulas that quantify skewness, which can be easier especially with large data sets. Using Excel’s SKEW() function, our data has a skewness of 1.11 overall, which is pretty high. A zero would be a perfectly symmetrical distribution, and negative skew would means skewness in the opposite direction, with sparse data stretched out below the average. Generally, skewness between -1 and 1 is mild.

But that’s not a big deal, is it? After all, the t-test only requires that the sampling distribution be normal, not the data distribution. The statistical high priests tell us that the sampling distribution will tend to be normal. Specifically, it will be close enough to normal *if the sample sizes are large enough*.

We’ve a pretty decent sample size -eighteen -so is that large enough?

Well, it depends on the shape of the distribution of data. In particular, skewness is the killer, where the greater the skewness in the data, the larger the sample has to be, blah, blah, but the short answer is no.

*No?*

No. Seriously. Don’t blindly do a t-test on completion times or reaction times from a usability test. The data is generally too skewed and the samples are generally too small to get an accurate answer from a t-test. You want more like 30 data points per design before you can stop worrying about the skewness commonly seen in completion times.

But I wasn’t just wasting your time. You *can* do a t-test on completion times or reaction times, but you must first *transform* the times. Yes, Igor! Transform my data! BWAHAHAHAHAHA!

Okay, what the hell is transforming data? It simply means mathematically manipulating each data point in the same way. In our case, we want to manipulate the data so that the distribution has little or no skewness which is what’s keeping us from doing a t-test.

The transformation that usually works for reaction and completion times is to take the logarithm of each time. Base 10 or natural logarithm, it doesn’t matter, but, out of habit, I use the natural logarithm (the LN() function in Excel). Here’s our natural logarithm transformed data:

Data | HTML | Flash | HMTL | Flash | |

User | Time | Time | LN(Time) | LN(Time) | Difference |

1 | 209 | 280 | 5.34 | 5.63 | 0.29 |

2 | 272 | 209 | 5.61 | 5.34 | -0.26 |

3 | 158 | 155 | 5.06 | 5.04 | -0.02 |

4 | 392 | 305 | 5.97 | 5.72 | -0.25 |

5 | 392 | 448 | 5.97 | 6.10 | 0.13 |

7 | 170 | 305 | 5.14 | 5.72 | 0.58 |

8 | 289 | 251 | 5.67 | 5.53 | -0.14 |

9 | 479 | 600 | 6.17 | 6.40 | 0.23 |

10 | 258 | 140 | 5.55 | 4.94 | -0.61 |

11 | 600 | 502 | 6.40 | 6.22 | -0.18 |

12 | 245 | 247 | 5.50 | 5.51 | 0.01 |

13 | 265 | 196 | 5.58 | 5.28 | -0.30 |

14 | 182 | 177 | 5.20 | 5.18 | -0.03 |

15 | 296 | 239 | 5.69 | 5.48 | -0.21 |

16 | 318 | 231 | 5.76 | 5.44 | -0.32 |

18 | 303 | 238 | 5.71 | 5.47 | -0.24 |

19 | 269 | 382 | 5.59 | 5.95 | 0.35 |

20 | 383 | 329 | 5.95 | 5.80 | -0.15 |

Statistics | |||||

Sample | 18 | 18 | 18 | 18 | 18 |

Average | 304.4 | 290.7 | 5.66 | 5.60 | -0.063 |

Std Dev | 111.9 | 123.1 | 0.35 | 0.39 | 0.290 |

Skew | 1.14 | 1.22 | 0.20 | 0.39 | 0.532 |

The log transformation pulls in the right-hand tail of a distribution making positively skewed distributions more symmetrical. Note the skewness is pretty much gone, both from the times for each version and the difference in the times. We have nearly equal number of datapoints on each side the mean. The difference between the averages is now -0.063 instead of -13.7, but of course it’s still showing the Flash version is faster than the HTML version. But now we can do a t-test and figure the probability of getting a difference of -0.063 with a sample of eighteen users.

There are no hard and fast rules for when to transform or not. Completion times typically are skewed, but don’t have to be. Skewness tends to be low for long, multi-step routine tasks with trained users, as long as none leave the computer to catch some waves in the middle of the task. Furthermore, what we really care about is the skewness of the differences in the times, which may or may not be skewed even when the source times data are skewed. However, the skewness you see is itself based on a small sample and can be off by some amount from the true amount of skewness in the population. Skewness can go all over the place for the differences in completion times, and what you see in the sample is not always the best indication of what’s in the population. I wouldn’t trust it.

As a rule of thumb for sample sizes typical in usability tests (i.e., 10 or less data points per design), you should generally assume completion times are skewed unless:

- There is very little positive skewness in the sample (say under 0.50), or,
- You have prior data or other information that indicates there is no skew.

For intermediate sample sizes of 10 to 30, I would suggest focusing on the skewness of the completion times, not the differences of the completion times. If the raw completion times are (close enough to) normal, then the differences in the completion times will be normal too. If doing the transformation reduces the skewness (makes it closer to zero) then it’s probably worth doing.

For over 30, you’re probably okay without any data transformation of completion time data (where skewness is generally less than 2). Let the Central Limit Theorem to its job.

The main reason not to do a transformation is that it changes a bit what you’re testing. Generally the average of the transformed data is not equal to the transformation of the average of the data. In our example, LN(-13.7) does not equal -0.063. A t-test on transformed data means you are no longer comparing arithmetic averages. Rather, you are comparing some other indication of the central tendency. That can make it harder to explain your results to your client.

Fortunately, in the case of the log transformation, you are effectively comparing the *geometric* averages. While the arithematic average is all numbers added together and divided by *n*, the geometric average is all numbers *multiplied* together taken to the *n*th root. Sort of an average worthy of a multi-gigahertz processor. This is fortunate because, for completion times, it seems the geometric average is generally a better indication of the central tendency than the arithematic average.

To get the geometric average from our averages of the transformed data, reverse the transformation of the averages:

EXP(average of LN-transformed data) = geometric average

EXP(5.66) = 287 seconds = geometric average of HTML

EXP(5.60) = 270 seconds = geometric average of Flash

Rating scales, like I said, tend not to be skewed much, so transformations are usually not necessary. However, you can get skewness -indeed you should expect skewness -if you are getting any floor or ceiling effects, where scores tend to be bunched up at one or the other extreme of the scale. For example, if our rating scale from Stat 201 averaged about 30 for its possible range of 5 to 35, meaning on average users were giving 6’s on the 7-point items, I would expect skewness. It’s as if users were pushed up against the ceiling of 35 points. Some apparently felt that “Agree” and “Strongly Agree” were not sufficiently strong for their level of agreement. They needed “Totally Agree,” “Massively Agree” and “F-ing Epicly Agree” beyond it. If you get skewed ratings due to floor or ceiling effects, you can try the following transformation to reduce skewness:

- Convert the user’s scores to the proportion of total possible range of points, so that 0 corresponds to the lowest possible score and 1 corresponds to the highest possible score. For example, with our scale that goes from 5 to 35 points, a 30-point score becomes (30 – 5)/(35 – 5) or 0.833.
- Take the arc sine (in Excel the ASIN() function) of the proportion. So a proportion of 0.833 becomes 0.985 radians (Excel by default gives arc sines in radians; it doesn’t matter if you use radians or degrees since they’re arbitrary units for a rating)

The above transformation has the effect of reducing floor and ceiling effects, stretching what otherwise would be compressed differences at the extreme ends of the scale.

The t-test follows the same steps as we saw in Stat 201:

**Step 1. Calculate your standard error**. For the transformed data, it’s:

*se* = 0.290 / SQRT(18) = 0.0683

**Step 2. Calculate your sample t-statistic**, which is still the deviation of your observation from the hypothetical state in units of the standard error. We’re dealing with transformed data now, but that doesn’t change Hypothetical State A. If the real difference in the un-transformed data is zero, then the difference in the log-transformed data will also be zero, assuming the data for each design have the same distribution as well as the same average. So we’ll still calculate the probability of seeing the -0.063 difference in the transformed data assuming the real difference is 0.

So the sample t-statistic is:

*tstat* = (*diff* – *hypo*) / *se*

*tstat* = (-0.063 – 0) / 0.0683 = -0.92

We draw our sampling distribution, as always, putting most of the distribution within a standard error of the average.:

**Step 3 through 3 and a half. Get the p-value**, working your way around Excel. For a within-subjects t-test -or any t-test on a single column of numbers like we’re doing here -the degrees of freedom is your sample size minus 1:

*df *= *n* – 1

*df* = 18 – 1 = 17

Since we’re doing a two-tailed test always enter the positive tail into Excel so it won’t trip out and give you an error, and let Excel double it for you to include the other tail.

*p* = TDIST(ABS(-0.92), 17, 2) = 0.373

No surprise here: it looks like it’s reasonably plausible that we’d see a difference as extreme as 0.063 in a sample size of eighteen when there is in fact no difference in the population. That’s just as we suspected when we calculated the standard error for the untransformed data. Properly transforming is not going to magically insert more certainty into your data; it’s just going to shift the data around as a group to make the statistical test work more accurately.

At this stage, however, we don’t know the likelihood of seeing this difference when there is in fact a considerable difference in the population. Time to test Hypothetical State B.

**Step 1. Calculate your standard error**. That doesn’t change when we go to Hypothetical State B:

*se *= 0.0683

**Step 2. Calculate your observed t-statistic**. Now we have to be careful. For untransformed data, Hypothetical State B was that the faster design takes 80% of the time as the slower design; in other words, the following ratio is true:

*faster*/*slower* = 0.80

However, taking the log transform changes this equation because the ratio of two numbers is not equal to the ratio of the logarithms of the same two numbers:

*faster*/*slower* <> Ln(*faster*)/Ln(*slower*)

So it would be incorrect to get the probability of seeing a difference of -0.063 assuming the ratio of the transformed data is 0.80. We need to convert Hypothetical State B into it’s equivalent with log transformed data. To do that, we take advantage of the following mathematical relation:

Ln(*x*) – Ln(*y*) = Ln(*x*/*y*)

Which means for Hypothetical State B to be true:

Ln(*faster*) – Ln(*slower*) = Ln(*faster*/*slower*) = Ln(0.80) = -0.223

In other words, Hypothetical State B states that the difference in the log-transformed completion times is -0.223. We’re doing a two tailed test, so we want to know the probability of getting a difference as extreme as 0.063 (positive or negative) given a real difference of 0.223 (positive or negative respectively).

So our t-statistic for Hypothetical State B is:

*tstat* = (-0.063 – -0.223) / 0.0683 = 2.35

Be sure the sign of your hypothetical state (positive or negative) matches the sign of your observed difference to allow for the fact that this is a two-tailed test.

Here’s the sampling distribution:

**Step 3 through 3 and a half. Get the p-value**.

Just another two-tailed test; the degrees of freedom haven’t changed:

*p* = TDIST(ABS(2.35), 17, 2) = 0.031

So it is a pretty safe bet that there is no considerable difference between the two designs -no good reason to believe one is definitely worth favoring over the other.

The good news is that your efforts to minimize the standard error appear to have paid off. With a single test of 18 users you can assert with high confidence that neither design is considerable faster to use than the other. The bad news is that you still have to choose what goes in the final web site, and the completion times are not going to help you much in this regard. Collecting more completion time data is not going to help here. Unless you’ve been extraordinarily unlucky, then neither Paint Your Board design takes less than 80% of the time as the other. Running more tests is just going to help you narrow down the range -maybe one takes less than 90% or 95% of the time, which is all in the “whatever” zone, so why bother?

You need to make a design decision and if completion times won’t help, you need something else to tell if one is even *slightly* better than the other. There are several ways you and Stan can proceed.

You may be tempted to put both designs up on the web site and make it a user option. After all, you went through the trouble of prototyping both -you’d hate to throw one away. Looking through your data, you see some users were faster with HTML and some were faster with Flash. Maybe it’s a matter of user personality or experience.

This is rarely a good strategy of design. Essentially, you’re saying, “I, a UX professional, can’t tell which is better, despite rigorous scientific testing, so I’m going to make the user, who doesn’t know Flash from Bash, make the decision.” First of all, just because a user was faster with, say, Flash, in a particular usability test doesn’t mean that user is *generally* faster with Flash. The faster performance may be due to various situational factors such as fatigue, learning effects, boredom, distraction, or whatever. You would need to compare each user on multiple Flash and HTML implementations to tell if there are indeed natural-Flash-users and natural-HTML-users.

The more serious problem is this: how are users supposed to know if they are Flash or HTML users, if such creatures exists? What do you put on the website to help them decide? You could provide access to the Flash version through an “Advanced” link, which is a common solution to this sort of problem, but users don’t know you mean “advanced” in a particular Flashy way. They are liable to think “Advanced” gives them more Paint Your Board options and control, like Advanced Search. It might work to label the Flash version “Paint Your Board with Drag and Drop” if users who know how to drag and drop also know what “drag and drop” means. I wouldn’t assume so. Whatever label you use, you’ll have to test it and see if users with a choice paint their boards faster than users without a choice. And remember: the very fact that you give the users a choice means it’ll take more time -time to make the decision and time to correct themselves if they choose the wrong one for themselves at first. Adding choice means adding complexity. You should do it when it solves a well understood user problem, not a poorly understood designer problem.

Perhaps the most educational thing you can do is study your quantitative and qualitative data to try to figure out what happened. You expected Flash to have faster completion times but be harder to figure out in the first place. But how exactly did that play out? What part of the Flash instructions took too long to read or were too hard to understand? Maybe there are some simple improvements you can make that will put the Flash version on top.

Review your debriefing comments from your users. Maybe certain users do better with Flash while others do better with HTML because of specific design elements in each. Maybe there is a way to combine the best of HTML with Flash to make a superior design for all users. For example, maybe you can add input fields to the Flash design for those who do well with numeric input, but apply those inputs instantly, rather than through a Submit button, in order to reduce the time-consuming cut-and-try of the HTML design.

Unfortunately, in this case, both designs have already been through a set of iterations and testing since their first paper prototypes, and there isn’t time left to further tweak the design. Any further improvements will have to wait until Versions 2.

Take a second look at the limitations of your usability test. All usability tests are to some degree a simulation of actual user behavior. If you consider the differences that likely exist between the test conditions and real conditions, maybe you can see a way to break the tie between the two designs. In tightly controlling the task in order to maximize decision potential, you also limited the range of conditions users used the feature. What may be the same performance in your usability test may not be the same on average when you generalize the results to the wider range of conditions that the users will encounter.

In this example, all your users were unfamiliar with both designs for the Paint Your Board feature, which is pretty typical of usability tests. However, it is reasonable to expect that experienced users will do especially well with the Flash versions since it’s main issue was learning it in the first place. A study of the qualitative data may find evidence to support this: the videos show users of Flash first struggling with drag and drop, but once they know what to do, they really take off. Breaking the task into two subtasks and running a couple more t-tests may confirm this.

If Flash is essentially tied with HTML for novice users but it looks like Flash will be faster for experienced users, then on average Flash is better for all users. Perhaps that’s enough to tip the balance. However, Stan tells you that he expects very few experienced users. Even surfers who keep multiple boards in their respective quivers are rarely going to order a custom board -maybe once a year or less for the serious mega-ripper. You don’t know if users will remember what they learned a year ago. It’s not a very strong argument for Flash.

You can look at other data to help break the tie. For example, if you had thought to also include rating scales in this usability test, maybe that would tell you which design the users preferred regardless of the completion times. Right. *That* would’ve been a good idea. Okay, but you have other data -serendipitous data you can extract from the videos, key-logging, and user outputs. For example, in Stan’s opinion the boards resulting from the Flash versions were more creative. Ten users painted their boards exactly the same with each design of Paint Your Board, but eight users painted their boards differently as they went from Flash to HTML or vice versa. Stan says that of the eight users that painted their boards differently, all but one user made a more creative board with Flash.

Hmm. If there were in fact no effect of the Paint Your Board feature design on creativity, then there would be a 50:50 chance of Flash boards being more creative. You turn to the binomial probabilities tables from Stat 101 and look down the 50% population rate column for a sample size of eight. It seems that the probability of 7 or more users getting a better board with Flash is 0.035 given a Hypothetical State of 50:50. We’re still in a two-tailed situation here, so double that probability to get 0.070 (congratulations: you just did a sign test). Statistically, it seems very likely that Flash does indeed result in more creative boards, at least in Stan’s opinion. Maybe users weren’t taking less time with the Flash version, but instead they were using the same amount of time to play with the Flash version to get something extra-rad. Maybe with the HTML version they were merely settling on what they could get in a comparable amount of time.

Sounds reasonable, and maybe that’s enough to favor Flash over HTML. Of course, it’s just based on Stan’s opinion, and maybe Stan was biased, subconsciously rating the Flash boards better because he feels Flash is so swick. To really do this right, you should bring in a bunch of other surfers and have them compare the boards without them knowing if the boards were painted with Flash or HTML. You’d have them rate the relative creativity on a scale so you’re capturing the magnitude of the difference for each user, not just the direction of the difference. This provides you with more information yielding greater decision potential. You’d analyze the data with a t-test on the average rating, with Hypothetical State A set to whatever rating is “no difference.” Yeah, that would be the right thing to do. Too bad there’s no time for more testing.

If after considering all the data, there still does not appear to be a clear winner, consider each design’s impact on other UX dimensions than usability. The HTML version provides better accessibility for all those blind surfers out there (there are some) who also want to impress the sweet wahines (or kanes) with their sweet paint job. However, you could argue that just providing mechanical access to the Paint Your Board controls doesn’t really provide adequate accessibility anyway since the feedback is inherently visual. It seems the Flash version is cooler. Stan is personally certainly thrilled by it -this is precisely the sort of heavy technology he was looking for to distinguish his site’s UX from the competition. It’s perf that something as advanced as his CNC surf board painter would have a coolaphonic UI. If the usability is really equal, that’s a good a reason as any other to go with Flash.

If the Flash and HTML version are really tied on completion time, maybe you should flip a coin to decide which design to go with.

No, don’t do that. The truth is they aren’t tied. Flash performed better in the sample. With 1- 0.373 = 62.7% certainty you can say that Flash is better than HTML. That’s not a whole lot of confidence when a coin flip gives you 50% certainty of being right, but it’s nothing to ignore either. While it’s plausible that both designs have equal completion times, it is more likely that Flash is better than HTML than vice versa. If all other considerations are equal, and you *have* to choose one design (and in fact you do), you should go with the one that performed better in the sample. That may not be a great bet, but it’s a better than the 50:50 bet a coin toss would get you.

Of course, going with the lowest average completion time is exactly what you would’ve done if you didn’t bother with the t-test or any inferential statistics, so you may be wondering what the point of it all was. The point was to make an informed decision. There’s a difference between being statistically mushed, where you are *uncertain* if one design is considerably better than the other, and this situation, where you are *confident* that there is no considerable difference between the designs on completion time.

So go ahead with Flash, but go ahead knowing that there really isn’t much difference in completion time worth worrying about. Go ahead, recognizing you may want to revisit the feature later and see if you can improve it. Go ahead when your decision is bolstered by consideration of other business goals, other data, and/or awareness of the the limitations of the usability test. In this case, no consideration or statistic by itself is a strong argument for the Flash design, but as you discuss the issues with Stan, it’s clear that in aggregate, going with Flash makes the most sense. After all, a single usability test statistic, however definitive, is just one element of everything that goes into a design decision.

There’s another reason to do the inferential statistics. The t-test for Hypothetical State B led you to conclude that Flash is not considerable better than HTML on completion time. However, remember it was a two-tailed test: you’re justified in also concluding that Flash is not considerably *worse* than HTML. So it was definitely worth doing the t-tests to give you and Stan some piece of mind. With high certainty, going with Flash won’t be a tragic decision.

**Solution**: Follow the flow chart below.

- Determine of a one- or two-tailed test is right for you.
- With your client, set your Hypothetical States A and B.
- Check your data distribution for skewness, transforming it if necessary.
- Conduct a t-test for Hypothetical State A
- Calculate your standard error.
- Calculate the t-statistic.
- Determine the p-value with something like Excel’s TDIST() function.

- If the p-value is sufficiently low to represent a tolerable chance of a Type I error rate, proceed with the new design.
- If the p-value represents an excessive Type I error rate, conduct a t-test for Hypothetical State B.
- If the p-value represents a tolerable chance of a Type II error rate, do not proceed with the new design.
- If the p-value represents an excessive Type II error rate, calculate the increase in sample size you need to get a low p-value for either Hypothetical State A or B.
- Increase your sample size.
- Repeat

Here’s an Excel sheet with this post’s data and analysis.

]]>The second in a series on inferential statistics, covering:

- Improving on an existing UI design.
- Estimating needed sample sizes.
- The normality assumption.
- Between-subject testing.
- One-tailed t-tests.

**Prerequisites:** Stat 101, a college-level intro stat course or equivalent sometime in your life, surfer lingo.

Here’s the scenario: Surfer Stan’s ecommerce site hasn’t been updated since single-fin boards made a comeback. Stan contacted you because he heard you had gnar-gnar technology to make one sick user experience. That means really good, he assures you. But to get the full contract, you had to prove it. You agreed to build an interactive prototype of the home page and test it out head-to-head against the existing site. You were looking to improve both the aesthetics and the clarity of the design. You thoroughly studied your users and the surfing domain, and got a good handle on, not only how your users see the task, but the values and culture of surfers. Users seemed really thrilled by the early prototypes. They were particularly impressed with how the pulldown menus curl out like breaking waves. You were expecting users to rate it much better than the inconsistent and unharmonious old version. However, on the big day when the fully functioning new version took on old, here’s what you found:

Old Home Page | New Home Page | Both Pages | ||

Data | ||||

User | Rating | User | Rating | |

1 | 23 | 2 | 26 | |

3 | 20 | 4 | 29 | |

5 | 31 | 6 | 21 | |

7 | 16 | 8 | 13 | |

Statistics | ||||

Sample | 4 | 4 | 8 | |

Average | 22.50 | 22.25 | 22.38 | |

Std Dev | 6.35 | 6.99 | 6.19 | |

Difference in averages |
-0.25 |

Look at the statistics. You remember standard deviation (Std Dev), right? It’s roughly how far on average the numbers are from their average. In this case, most of the eight ratings are within about 6 points of the average 22.38. We use the “unbiased estimate of the population standard deviation,” the one with the weird (n – 1) in the denominator of its formula. It’s the STDEV() function in Excel. Also check out the distribution of the data, graphing it as a histogram or number line if you like. It’s got a reasonable bell-shape to it, with most scores clustered in the middle around 22, and few at the extremes. Could be normally distributed, which would be nice, but, as we’ll see, it doesn’t have to be for statistical analysis.

Something else should jump out at you: the averages. The new design scored *lower* than the old design. The difference in the averages (new average minus old average) is -0.25. That’s a negative statistic in more than one sense. What happened? Nothing wrong with your methodology. You tested a total of eight users, assigning four at random to try each version. That’s not an atypical number for a usability test. But can you trust the results? With only four users assigned at random to each version, maybe you happened to put one or two negative nutters on the new version. I mean, check out the dawg who gave the new design a 13. He’s way out there. Is there a reasonable chance you just had bad luck? Put it this way: what is the probability of seeing a -0.25 point difference by chance when eight people randomly use one or the other version?

Here’s where we return to what we learned in Stat 101. When Stan asks you, “what’s the probability, bubba?” you ask back, “the probability *given what*, duder?” You and Stan need to decide on Hypothetical States A and B.

Hypothetical State A: What true improvement in ratings is definitely *not* worth the new design?

Hypothetical State B: What true improvement in ratings *is* definitely worth the new design?

In some cases, any improvement at all is worth at least considering. That is, Hypothetical State A should be any difference in the ratings greater than zero (using the arbitrary convention of subtracting the old rating from the new rating). That’s probably what you were taught to do in stat class, and it makes sense for major sites with a lot of users where even a minuscule improvement translates into big total gains that will pay for the cost of redesign in no time. It can also make sense for niche sites with few users if the site is going to be redesigned anyway (or created in the first place), and the client wants some indication that *your* redesign in particular is going to help. However, Stan has a very small operation and he already has a functioning web site that’s doing okay. To justify the expense of building a new one, you have to do better than just *any* improvement.

There’re several ways to arrive at your hypothetical states.

**Equivalent Cost**. If you can somehow estimate the increase in revenue for each incremental improvement to the site, then you can figure how much improvement you need to pay for the site redesign in a reasonable amount of time. For example, if each point of improvement on the satisfaction scale is correlated with 2 more conversions per month totaling $500 in revenue, then to pay off a $36,000 site redesign in two years, you need at least a 3.0 point improvement. That would definitely make the redesign worth doing (Hypothetical State B). The improvement that’s definitely *not* worth redesigning (Hypothetical State A) would be somewhat less than 3 points, when you consider that immediate conversion in a single session isn’t the only consideration -brand loyalty and word-of-mouth have value too. Generally, you’re not going to be able to translate scale scores into conversions, however -you won’t have the data. Stan certainly doesn’t have that data.

**Percentile Improvement**. If you’re using a standardized scale, such as SUS, or a scale for which you have data on a lot of sites, then you can shoot to improve the percentile ranking of the site. For example, you and Stan may agree that improving the site 10 percentage points is definitely worth doing while improving it only 2 percentage points is definitely not worth doing. However, in this case, you’re using a scale tailored specifically for surfers (real surfers, not web surfers) to measure specifically the kind of experience that Stan wants to achieve, so you have no data on percentiles.

**Scale Details**. You and Surfer Stan take a third approach, which is to look at the scale items and judge what change would qualify as a sufficient improvement. For Hypothetical State A, you select the “who cares?” level of improvement, while for Hypothetical State B, you select the “effing yeah!” level of improvement. In this case, let’s say each user’s rating is the sum of five 7-point Likert scales for items like “The home page is totally righteous” (1 – strongly disagree to 7 – strongly agree), so the possible lowest score is 5 (all ones) and the highest is 35 (all sevens). From this, Stan reasons that if the new design on average only improves one item by one point (e.g., moves “totally righteous” from 4, neutral, to 5, “somewhat agree,”), then it’s definitely not worth it. In other words:

Hypothetical State A: The difference in the averages is 1.00 points.

Alright.

Of course, right now the difference in averages is *negative* 0.25 points -your new design performed *worse* than the old design, so it’s looking like an uphill battle.

Following the same line of reasoning, you and Stan agree that if on average most items (3 out of 5) improve by one point on the scale, then it’s definitely worth going for the new design. So,

Hypothetical State B: The difference in the averages is 3.00 points.

This makes a “whatever zone” of 1.00 to 3.00 -a range of population values where it doesn’t make much difference to Stan whether he stays with the old design or proceeds with the new design. For some reason, I think Surfer Stan would appreciate the “whatever zone.”

If your stat class is coming back to you, you probably recognize Hypothetical State A as corresponding to the “Null Hypothesis” (H_{0}), which is correct. You may also recognize Hypothetical State B as the “Alternative Hypothesis,” which is completely wrong. For one thing, A and B are not exhaustive -it’s possible for both to be wrong about the population. Most intro stat classes, and most advance classes for that matter, don’t cover Hypothetical State B or its equivalent. Setting and testing Hypothetical State B is a procedure I made up to ensure there’s adequate statistical power in the analysis, which you probably didn’t worry too much about in your intro stat class. Such classes are geared towards applications in science where Type I errors are much worse than Type II errors. It’s true that all scientists try to get as much power as they can (by that I mean statistical power, not evil scientist take-over-the-world power). However, rarely do scientists try to quantify their power. But we’re doing usability testing where a Type II error is as bad as a Type I, so we’re going to be setting and testing Hypothetical State B in addition to A.

At this stage, it may be a good time for you and Stan to discuss how much risk of error he’s comfortable with. Soon you’ll be looking at p-values and have to decide if it’s sufficiently low to make a go/no-go decision on whether to proceed with the re-design. So, Stan, what chance would you tolerate being wrong? What chance are you willing to take that you’re redesigning the site when you definitely shouldn’t, or not redesigning the site when you definitely should?

Scientists use a 0.05 probability -that’s their “level of statistical significance.” Stan may be a thrillseeker among major brutal waves, but he’s pretty conservative with his business. But still, a 0.05 probability strikes him as a little strict. He says he’s happy with 0.10 -a 10% chance being wrong. Dude, that’s 90% chance of being right.

We now proceed to calculate the probability of getting a -0.25 point difference if Hypothetical State A is true. Our logic is this: If our observed difference is implausible assuming Hypothetical State A is true, then we proceed with the redesign. That is, if the probability of getting at least the observed difference is less than 0.10 for the Definitely Don’t Redesign state, then we conclude that the true difference in the population is higher -it’s at least in the Whatever Zone (where redesigning isn’t an appreciably *harmful* choice to Stan’s business), and may be at or above the Definitely Redesign threshold.

The t-test is the tried-and-true procedure for calculating the probability of an observed difference in averages. It can detect differences in samples when other statistical tests can’t. You might think data is data, but some test procedures are more efficient than others, acting like a more powerful microscope to detect smaller effects. Among all test procedures, the t-test ranks among the top.

But the t-test has a catch: it’s only accurate if the sampling distribution is *normal* -if it has that bell-shape that was probably on the cover of your stat textbook. The sampling distribution, is not, contrary to its name, the distribution of your sample data. A t-test does not require that your scale scores be normally distributed. That’s good, because a normal distribution isn’t just any bell-shape. It’s a very specific bell-shape. Our data is bell-shaped, but I don’t know if it’s a normal distribution. More to the point, I don’t care, because only the *sampling distribution* has to be normal.

So what’s a sampling distribution? It’s that thing that you never quite got in stat class. One minute you’re doing histograms and other pretty pictures, and stats is so easy, then POW! Along come sampling distributions, and you lucky to eek out a C+. So never mind. Move along.

Okay, I’ll give it a try. A sampling distribution is this: You’ve got a difference of two averages equal to -0.25 points. You want to know the probability of seeing that -0.25 in a sample of eight users. I mean, if you did the usability test again on a different sample of eight users, you’ll almost certainly see a different difference in the averages. Maybe it would be -1.00 points. Maybe 2.50 points *in favor* of your new design. Now imagine you’re the god of statistics. You can run a usability test on *every possible sample* of eight users. Zillions of usability tests. For each one, you get the difference in the averages, zillions of differences. Graph those zillions of differences as a histogram. *That’s* a sampling distribution. A sampling distribution is the distribution of numbers (statistics) calculated from every possible sample. Each entire sample supplies *one* number to the distribution, not all of its data.

Only the gods have ever seen an actual sampling distribution, but fortunately our ancient ancestors, the high priests of statistics, discovered the central limit theorem, which proves mathematically that sampling distributions involving averages tend to be normally distributed. “Tend to be”? I admit you’d expect more certainty from a mathematical proof, but practically you can assume your sampling distribution is close enough to normal if your data distributions are even vaguely bell-shaped and symmetrical, with extreme scores on one side of the average balanced by extreme scores on the other. Rating scales tend to be bell-shaped and symmetrical, so I think we’re cool on the normal sampling distribution requirement.

If you’re still with me, you can see why the sampling distribution is so important to our problem. If we know the exact distribution of the difference in averages for every possible sample, we can tell how likely we’d get our particular difference in averages (-0.25 points). Is it right in the fat part of the sampling distribution where there are a lot of possible samples that produce about that number? Then there’s a high probability we’d see -0.25 points. Or is it off in one of the tails of the sampling distribution where there are relatively few possible samples that could produce it? Then we have a low probability we’d see -0.25 points.

The central limit theorem tells you the shape of the sampling distribution, but to get those probabilities, you need to know the size of the sampling distribution. Fortunately the high priests of statistics come to the rescue again. They discovered there is a mathematical relation between the sampling distribution and the statistics in a sample. You can estimate characteristics about the sampling distribution from your sample data. Specifically, the estimated standard deviation of the sampling distribution of the difference between two averages from two separate groups of users is:

*se* = SQRT(*s1*^2/*n1* + *s2*^2/*n2*)

*S1* and *s2* are the standard deviations from each of your groups of users, while *n1* and *n2* are the samples sizes for each group of users.

The estimated standard deviation of a sampling distribution is called the “estimated standard error,” but I’ll just call it the “standard error” for short and symbolize it with *se*. They call it the standard error because it’s the amount of sampling error you could easily expect to get from your sample.

So the estimated standard error for your rating scale data is:

*se* = SQRT(6.35^2 / 4 + 6.99^2 / 4) = 4.72

You, mere mortal, now know that if you could run every possible sample of eight users through your usability test, the standard deviation of those zillions of differences of the averages will be about 4.72. You’ve observed a -0.25 point difference between the averages, but you now know that -0.25 points could easily be off by 4.72 points from the real difference in the population. The real difference could be more negative… or it could be positive, where the new page is better than the old. That’s notorious, bro! You *are* like a god!

Now that we know the shape and size of the sampling distribution you can calculate the probability of seeing a -0.25 point difference given a hypothetical true difference of 1.00. Actually, the probability of getting exactly -0.25, or 0.00, or any single exact number is pretty small because, no matter what, there’s a small number of all the possible samples that have exactly that difference. Even the chance of getting a sample difference of exactly 1.00 -precisely the hypothetical population difference -is relatively small (though larger than any other probability), given you’d expect some deviation due to sampling error. Getting 1.00 in your sample would be like getting exactly 500 heads after flipping a coin 1000 times. It would be a big coincidence, in a weird way.

What we actually want to do is calculate the probability of a *range* of values that start from the observed difference that would *include all values beyond the observed value that would lead to the same design decision*. This will include either all values greater than your observed difference or all values less than your observed difference. For the case of Hypothetical State A, if the observed difference is sufficiently larger than 1.00 (the “who cares?” threshold), then you decide to redesign the site. Since we’re looking for *larger* values, we want the range *greater* than or equal to our observed difference. We will calculate the probability of getting -0.25 or more given a hypothetical true difference of 1.00. This will include all differences from -0.25 through the right end or “tail” of the sampling distribution

For Hypothetical State B, if the observed difference is sufficiently smaller than 3.00, then you decide to not redesign the site. Since we’re looking for *smaller* values to make a design decision, we want the range *less than* or equal to the observed difference. We will calculate the probability of getting -0.25 or less given the hypothetical true difference of 3.00. This will include all differences from -0.25 to the left tail of the sampling distribution.

That’s the pattern of greater-than and less-than that you get when the larger the values are “better” (i.e., lead you to favor the new design over the old). If you were using a different measure of product performance where the smaller the value the better (e.g., task completion time), it would be backwards: For Hypothetical State A, you’d want the probability getting less than or equal to your observed difference, and for Hypothetical State B, you’d want the probability of of getting greater than or equal to your observed value.

Here’re the steps for a t-test, applied to Hypothetical State A, the probability of getting a -0.25 point difference or more given the true difference is 1.00.

**Step 1. Calculate your standard error** from the sample standard deviations. Oh, right. We already did that. It’s 4.72.

Right away, you can predict this t-test is not going give you a small p-value that would lead to a design decision. You’ve an observed difference of -0.25, which is only 1.25 points from Hypothetical State A -that’s less than the standard error of 4.72. Your observed difference is well within range of how much an observed difference *typically *deviates from the hypothetical value from sample to sample. It has to be pretty plausible that you’d observe -0.25 given a hypothetical true of 1.00. But let’s forge on.

**Step 2. Calculate your observed t-statistic**, which represents the deviation of your observation from the hypothetical state in units of the standard error. A

*tstat* = (*diff* – *hypo*) / *se*

The numerator of the t-statistic formula captures the deviation between the observed difference (*diff*) and the hypothetical difference (*hypo*). Dividing by the standard error converts your units from the units of the scale (5 to 35 points) to universal units of number of standard errors. This way, we don’t need to custom-make a sampling distribution for every variable you measure and test. We can compare your difference in “t-units” to a single standard t distribution. A t distribution is a normal distribution with an average of 0 and a standard deviation of 1, with some small adjustment to its shape to account for the fact that we only have an estimate of the standard deviation based on a sample of a particular size. Since our sampling distribution is (close enough to) normal and we estimated the standard deviation (the standard error) from a sample of eight users, it’s just the thing we need. Bow down once again to the ancient statistical priests.

So with our data,

*tstat* = (-0.25 – 1.00) / 4.72 = -0.26

**Step 3. Get the observed p-value for your t-statistic.** If you had stat back when all surfboards qualified as guns, you’d used printed tables for this. Today, we can use a spreadsheet function. In Excel, it’s TDIST(), which takes as parameters your t-statistic, your degrees of freedom, and the number of “tails.” The degrees of freedom are used to adjust the normal distribution to account for using an estimated standard error. The degrees of freedom for two separate groups of users is:

*df* = *n1* + *n2* – 2

Or 4 + 4 – 2 = 6 in our case.

For now, put in one for the number of tails. We’ll get to two-tailed tests in Stat 202.

Now if you’ve done exactly like I said with Excel, it gives you an error rather than a p-value. Did you do something wrong? No, remember MS’s unofficial slogan:

For reasons I cannot fathom, Excel refuses to accept negative t-values, which is very strange since half of all t-values in the t-distribution are negative. It’s like a cashier that only accepts one, ten, and fifty dollar bills. Furthermore, the TDIST() function only gives the p-values for the entered t-value or larger. There’s no flag to pass to tell it you want the p-value for a given t-statistic or smaller.

But it’s not a bug, it’s a feature, because it forces you to draw a diagram of the sampling distribution in order to figure yourself out of this problem. When you do, you might see something surprising.

Okay, draw your sampling distribution. It’s a normalish curve with a standard deviation equal to your standard error (4.72, in this case). The midpoint is equal to your hypothetical difference (1.00 for Hypothetical State A). Mark your observed difference on it (-0.25). Below each number, mark the equivalent in t-units: 0 for the midpoint, 1 for the standard error, and -0.26 for the observed t.

Now shade the area that you’re calculating the probability for. For Hypothetical State A than’s everything at or greater than -0.25 all the way off the right tail of the distribution.

Whoa, is that right? More than half of the curve is shaded. The total probability for the entire curve is 1.00. That is, you have a probability of 1.00 (total certainty) of getting some difference that’s in the sampling distribution. That makes sense: the sampling distribution by definition includes every possible difference from every possible sample. But if most of the sampling distribution is shaded then the probability of getting a difference of -0.25 points given a population difference of 1.00 point is more than 0.50. Getting a difference of -0.25 or more is not only plausible, its *likely* given the real difference of 1.00. There is no way you’re going to get a p-value less than 0.10, or less than any reasonable chance anyone would want to have of making a Type I error. It’s not just an uphill battle. It’s an unscalable wall. This is going to be true whenever your range of values includes the hypothetical state.

**Step 3-and-a-half. Work around the limitations of your spreadsheet functions.** So, to be fair to Microsoft, you don’t need to do use TDIST() in this case. But let’s undo the problem Excel caused and get the exact p-value anyway so you know how to do it. To do this, we rely on the fact that the t-distribution is symmetrical. The p-value for -0.26 or greater is equal to the p-value for 0.26 or less. So the solution is to calculate the p-value for 0.26 or less.

Problem: the TDIST() doesn’t give the p-values for a t-or-less. Fine. We know the entire distribution totals to 1, so we get the p-value for 0.26 or more and then subtract it from 1.

So the p-value for -0.26 or more is equal to 1 – TDIST(ABS(-0.26),6,1), Sheesh. There really isn’t any good way to fix this other than going through sketching the sampling distribution and figuring it out.

TDIST() for a *tstat* of 0.26, *df* of 6 and 1 tail gives a p-value of 0.400. So the p-value for Hypothetical State A is 1 – 0.400 = 0.600. Just as we suspected, we’re likely to get a difference of -0.25 or more in the sample when the real difference is 1.00.

Okay, so we’ve nothing to statistically compel us to redesign the site. But does mean we should stick with the old design? Not by itself. It’s time to calculate the probability of getting -0.25 points if Hypothetical State B were true about the population. Now we have the complimentary logic: If our observed difference (or less) is implausible assuming Hypothetical State B is true, then we stay with the old design. That is, if there is less than a 0.10 probability of getting no more than -0.25 points for the Definitely Redesign state, then we conclude that the true difference in the population is lower than 3.00 points -it’s at least in the Whatever Zone (where, staying with the old design isn’t an appreciably harmful choice to Stan’s business), and may be at or below the Definitely Don’t Redesign level.

For Hypothetical State B, Step 1 is already done -the standard error isn’t affected by the hypothetical states. Step 2 is:

*tstat* = (-0.25 – 3.00) / 4.72 = -0.69

** **

For Step 3, sketch the sampling distribution:

We’ve another negative t, so we have to use the equivalent other side of the t-distribution to use the TDIST().

You see in this case we are getting the p-value for the equivalent t *or more*, so we *don’t* have to subtract the result from 1. The p-value is just TDIST(0.69,6,1), or 0.259.

Hmm. It seems that’s it *also* pretty plausible you’d get -0.25 points or less when the true population difference is 3.00. The p-value is higher than 0.10, so now you’ve nothing to compel you to keep the old design. It’s a no-win situation. If you proceed with the new design, you have a 60% chance of making a Type I error and redesigning when you definitely shouldn’t. If you keep the old design then you have almost 26% chance of making a Type II error and not redesigning when you should. Basically you can’t make a reasonably safe design decision one way or the other. Dude, both of your options suck. You’re bobbing in statistical mush going nowhere.

In this particular case, you may want to offer to Stan that you work to improve the new design for no additional cost to him. Given your observed difference in the averages is within a standard error of Hypothetical State A, you know it’s quite plausible this -0.25 you’re seeing is sampling error. However, the fact that the users rated the new design as worse than the old means it’s more likely the new design really is worse than the old than vice versa. It may not be *much* more likely, but it’s still somewhat more likely. Maybe you should make the design better before subjecting it to more summative testing.

On the other hand, if the new design already represents your best effort, and there’s no hints from the usability test on improving the new design, maybe you’re business is better off testing more users now than wasting time blindly trying to improve what you have. If the new data forces you to conclude the new design is worse than the old, *then* you can offer to try to improve the design some more. Or you can walk away. Maybe this just isn’t the project for you.

This course of action only applies when the new design appears to be doing much worse than intended -like here where it’s doing worse than the old design. If the new design were doing, say, 3 points *better* on average than the old, you’d still be getting pretty big p-values given the hypothetical states (go ahead and calculate them assuming the same standard error), but it would be encouraging -the sample’s performance would be right on the Definitely Worth It threshold. It would seem you only need a bigger sample size to be convinced it’s real.

That’s the other course of action. You could collect more data. We saw in Stat 101 how larger sample sizes mean smaller Type I and Type II error rates for a given level of user performance. That’s the whole reason why you prefer large sample sizes when you can afford them. Larger sample sizes increase what I call the *decision potential* of your usability test -your ability to make a decision at acceptable Type I and Type II error rates. Statisticians loosely talk about larger sample sizes increasing statistical power (that is one minus the Type II error rate), but that’s only because in scientific work the Type I error rate is traditionally fixed at 0.05, so increasing the sample size will only change the Type II error rate. However, emphasizing only the impact on power masks the fact that bigger sample sizes also helps you make decisions regarding Hypothetical State A as well as Hypothetical State B. For a given Type I error rate (or level of statistical significance), bigger samples mean you can confidently decide to redesign with a sample user performance closer the Hypothetical State A. They also mean you can confidently decide *not* to redesign with a sample performance closer to Hypothetical State B. Bigger samples give you a smaller in-between range of performance when you can’t decide either way.

You can see the role of sample size on decision potential in the t-statistic formula. The t-statistic represents the deviation of your observation from the hypothetical state in units of the standard error. To get a lower p-value for either hypothetical state, you need a larger t-statistic -a greater deviation. Looking at the formula for the t-statistic, you see you can’t do anything to increase the numerator to get a bigger t-statistic, at least, not anything ethical. You can hope that collecting more data will shift the difference in averages to something more favorable for your design, but you can’t make it happen. It’s going to be what it’s going to be (Whoa, says Stan, statistics are *deep*, bro).

But you can do something to shrink the denominator and thus get a bigger t-statistic. The formula for the standard error shows that gathering more data directly reduces it. Make either or both group sizes bigger (bigger *n1* and/or *n2*), and the standard error gets smaller. That’s just another way of saying what you already know intuitively: the bigger the sample the more accurate the statistics that come from it -the less your observed difference in the averages will tend to deviate from the real one. Since the t-test for Hypothetical States A and B use the sample standard error, a smaller standard error increases the t-statistic for both. Bigger t-statistics mean greater decision potential.

To summarize: bigger sample size means smaller standard error means bigger t statistics mean smaller p-values mean you make a design decision at acceptable Type I and Type II error rates. Not only do you know that increasing the sample size will get you to a design decision, but, because you have standard deviations for each group, you can estimate *how much bigger *your sample needs to be. Surfer Stan had told you that he prefers his Type I and Type II error rates be kept around 0.10. Play with Excel’s TDIST() function a little, and you’ll find you need a t-statistic of about 1.3 to get a p-value of 0.10 with larger sample sizes (and therefore more degrees of freedom). A little algebra tells us:

*tstat* = (*diff* – *hypo*) / *se*

*needed se = *(*diff – hypo*)* / tstat*

In the case of Surfer Stan, and Hypothetical State B:

*needed se = *(-0.25 – 3.00) / 1.3 = 2.50

That is, assuming the difference in the averages remains the same at -0.25, you’ll be able to conclude the new design is not considerably better than the old if you can get the standard error down from 4.72 to 2.50 -you need to cut the standard error almost in half. Normally, you can use the same formula for Hypothetical State A to see how much smaller the standard error has to be to conclude the new design is better than the “who cares?” level of performance, but in this case, with the new design performing worse than the old, there’s no such number -when your shaded region of the sampling distribution passes through the hypothetical value, your p-values will always be greater than 0.500.

Now let’s figure out how many more users you need to run to reduce the standard error to 2.50. Assuming you have equal numbers of users trying each design (and it’s generally a good idea to try to accomplish that) then the standard error is inversely proportional to the square root of your sample size. Algebraically:

*se* = SQRT(*s1*^2/*n1* + *s2*^2/*n1*)

Given *n1* = *n2*, we’ll just call the sample size of each group *ng*, so *ng* = *n1* = *n2*:

*se* = SQRT(*s1*^2/*ng* + *s2*^2/*ng*)

*se* = SQRT( (*s1*^2* * + *s2*^2)/*ng *)

*se* = SQRT(*s1*^2* * + *s2*^2) / SQRT(*ng*)

* *

If the size of the standard error is inversely related to the square root of your sample size, then the ratio of your current standard error over your needed standard error is equal to the ratio of the square root of your needed sample size over the square root of your current sample size:. Mathematically:

(*se current*) / *se needed*) = (SQRT(*ng needed*) / SQRT(*ng current*))

More algebra:

*ng needed = *(*se current*) / *se needed*)^2 * *ng current** *

You currently have 4 users in each group (*ng* = 4). Given a standard error of 4.72, and a needed standard error of 2.50:

*ng needed = *(4.72 / 2.50)^2 * 4 = 14.28

Conclusions:

- You need an estimated 14 users (rounded) per group, which means,
- Since you have 4 users so far, you need 14 – 4 = 10 more users per group, which means,
- You need to find a total of 2 * 10 = 20 more users, which means,
- You’ve at lot more usability testing to do, which means,
- You can forget about spending two days of this business trip lying on the beach, conducting “ethnographic research.”

Going from a sample of eight to 28 seems like a lot more work than should be necessary (especially just to convince yourself the new design won’t be substantially better than the old). However, remember the standard error varies with the inverse of the *square root* of the sample size. If you need to cut the standard error in half, you need to *quadruple* your sample size. Even then, running 20 more users does not guarantee you’ll be able to make a go/no-go decision. It’s an estimate assuming the difference between the averages remains at -0.25 and the standard deviations remain the same. They probably won’t, of course, because there’s sampling error. But look at the bright side: running a nice big sample also gives you the best chance of finding out that the new design is actually better than the old, if it really is better. Text me when you’re done. I’ll be hitting the waves.

A day or two later, here’s data from 20 more users appended onto the original data, along with the revised statistics:

Old Home Page | New Home Page | Both Pages | ||

Data | ||||

User | Rating | User | Rating | |

1 | 23 | 2 | 26 | |

3 | 20 | 4 | 29 | |

5 | 31 | 6 | 21 | |

7 | 16 | 8 | 13 | |

9 | 28 | 10 | 26 | |

11 | 20 | 12 | 21 | |

13 | 25 | 14 | 31 | |

15 | 17 | 16 | 29 | |

17 | 18 | 18 | 27 | |

19 | 21 | 20 | 26 | |

21 | 17 | 22 | 29 | |

23 | 16 | 24 | 33 | |

25 | 35 | 26 | 23 | |

27 | 15 | 28 | 29 | |

Statistics | ||||

Sample | 14 | 14 | 28 | |

Average | 21.57 | 25.93 | 23.75 | |

Std Dev | 6.14 | 5.11 | 5.97 | |

Skew | 1.05 | -1.23 | -0.02 | |

Difference in averages |
4.36 |

Now we’ve got a different story to tell: The new page scored 4.36 points better than the old. Let’s run a t-test on Hypothetical State A.

Step 1: Standard Error

*se* = SQRT(*s1*^2/*n1* + *s2*^2/*n2*)

* *

*se* = SQRT( 6.14^2 / 14 + 5.11^2 / 14 ) = 2.13

Cool. Due to small changes in the standard deviations, our standard error is a little lower than we hoped to get by adding 20 more users.

Step 2: t-statistic

*tstat* = (*diff* – *hypo*) / *se*

*tstat* = (4.36 – 1.00) / 2.13 = 1.57

With a bigger difference in the averages and a smaller standard error, it’s no surprise the t-statistic is substantially larger.

Step 3 through 3-and-a-half: p-value

We sketch our sampling distribution. We’ve a positive t-statistic and we we want to know the the probability of getting a difference of 4.36 or greater, so it looks like this:

For once, we don’t have to go through any contortions to use TDIST(). Degrees of freedom are now:

*df* = *n1* +* n2* – 2

*df* = 14 + 14 – 2 = 26

And so:

*p* = TDIST(1.57, 26, 1) = 0.064

The p-value is 0.064. It’s pretty implausible that the new design is insufficiently better than the old -maybe not the most implausible thing you’ve ever faced, but implausible enough that Stan believes giving you the contract for the entire new site is a good business decision with tolerable risk. It seems the initial result from the sample of eight users was just sampling error. Lucky.

You can run the t-test for Hypothetical State B if you want, but you should already be able to tell it’ll have a pretty big p-value since the observed difference is within one standard error (the new smaller standard error) of the hypothetical value. In fact, since the observed difference (4.36) is greater than the hypothetical state (3.00), and we’re testing for the probability of getting the observed difference or smaller, you know the p-value will be over 0.500. It’s a mirror image of the situation we had with the initial sample and Hypothetical State A. However, at this stage it doesn’t matter what the p-value is. If the p-value were very low, then you’d be pretty sure there isn’t a *definite* advantage with the new design, but you’d still be pretty sure it is at least better than the “who cares?” level of performance. You’d conclude you’re in the Whatever Zone, to which Stan would likely say, “whatever,” and give you the contract, especially after already sinking some money into the test home page.

So congratulations. On to more designing, more usability testing, and more statistics. Whoo!

**Solution**: Follow the flow chart below. Consider it provisional, because there are other issues to address that we’ll cover in Stat 202.

- With your client, set your Hypothetical States A and B.
- Conduct a t-test for Hypothetical State A
- Calculate your standard error.
- Calculate the t-statistic.
- Determine the p-value with something like Excel’s TDIST() function.

- If the p-value is sufficiently low to represent a tolerable chance of a Type I error rate, proceed with the new design.
- If the p-value represents an excessive Type I error rate, conduct a t-test for Hypothetical State B.
- If the p-value represents a tolerable chance of a Type II error rate, do not proceed with the new design.
- If the p-value represents an excessive Type II error rate, calculate the increase in sample size you need to get a low p-value for either Hypothetical State A or B.
- Increase your sample size.
- Repeat

Here’s an Excel sheet with this post’s data and analysis.

The simple formula for degrees of freedom, *df* = *n1* + *n2* – 2, is good enough for most usability testing situations. However, if there are big differences in standard deviations of your two groups of users, then you need to make complicated adjustment. As a rule of thumb, your should calculate the adjustment if one standard deviation is twice the size or larger than the other. For example, you’d make the adjustment if, after running 20 more users, the new home page had a standard deviation of, say, 12.00 while old home page remained at 6.14.

How complicated is the adjustment? Well, first let’s define the error variances, *v1* and *v2*, for each group as:

*v1* = *s1*^2 / *n1*

*v2* = *s2*^2 / *n2*

The adjusted degress of freedom are then:

*df* = (*v1* + *v2*)^2 / (*v1*^2 / ( *n1* – 1) + *v2*^2 / (*n2* – 1))

When using TDIST(), round the result to the nearest integer. TDIST() accepts only integers for the degrees of freedom and will truncate any decimal number, so rounding is more accurate.

When the group standard deviations are about the same, the results are essentially equal to what you get with the simple *df* formula. For example, with the standard deviations of 6.14 and 5.11 that we got with 28 users:

*v1* = 6.14^2 / 14 = 2.69* *

*v2* = 5.11^2 / 14 = 1.86

*df* = (2.69 + 1.86)^2 / (2.69^2 / ( 14 – 1) + 1.86^2 / (14 – 1)) = 25.17

We round 25.17 to 25, which, yields same p-value (rounded to three places) as we got with df = 26 earlier:

TDIST(1.57, 25, 1) = 0.064

A few degrees of freedom here or there don’t make much difference when the total sample size is about 30 or more.

However, if the standard deviations were very different, such as 6.14 and 12.00:

*v1* = 6.14^2 / 14 = 2.69* *

*v2* = 12.00^2 / 14 = 10.29

*df* = (2.69 + 10.29)^2 / (2.69^2 / ( 14 – 1) + 10.29^2 / (14 – 1)) = 19.36

Rounding to 19,

TDIST(1.57, 19, 1) = 0.066

Okay, it *still* doesn’t make much difference, but it’s good we played it safe and made the adjustment. The smaller sample sizes, the bigger the difference. For example if there were 4 per group, the p-value goes from 0.083 with simple formula to 0.095 for the complicated adjustment (the adjustment can only increase p-values).

So, you’re generally safe to use the simple formula for degrees of freedom in usability testing. On the other hand, you can do no wrong using the complicated adjustment. Just to be extra-anal, I’ve updated the spreadsheet with this post’s examples to use the adjustment.

Parenthetically, a large difference in your standard deviations may itself be a signficant finding, in both senses of the word. For a simple way to get the p-value for the differences of two sample standard deviations, read up on the F-test. A higher standard deviation in one group indicates those users are comparitively polarized. For example, if the new home page had a standard deviation of 12.00, it would suggest that, relative to the old home page, users tended to either love or hate the new home page. That could have design or deployment implications.

If you’re leafing through your intro stat textbook trying to figure out where all this is coming from, then it’s this: the procedure I’ve outlined in this post is a “t-test for separate variance estimates,” in contrast to the “t-test for a pooled varance estimate” that most textbooks present. The t-test for a pooled variance estimate assumes the two groups have the same population standard deviation and any difference you see in the sample standard deviations is sampling error. You then estimate the population standard deviation with essentially a weighted average of the two sample standard deviations.

However, my philosophy is don’t assume anything you don’t have to assume. Usually, it’s pretty reasonable to assume the two groups have the same population standard deviation, but why take the chance? The t-test with separate variance estimates is always the safe option. It gives the same p-value as the t-test for a pooled variance estimate when your two groups have the same number of users and there’s little difference in the sample standard deviations, both which are usually true in usability testing. However, it protects you in case any differences in the sample standard deviations are, in fact, reflected in the population. So you lose nothing but gain peace of mind by using the the t-test for separate variance estimates. The only drawback is the more complicated adjusted degrees of freedom calculation, which is important to do when sample standard deviations are *very* different and sample sizes are small.

I forgot about adjusting the degrees of freedom. Hey, you can’t believe everything on the web.

]]>Now you’ve done it. You should’ve kept your mouth shut. There you were, comparing user ratings between a couple web home pages, and, to your disappointment, the amateurish proto-CSS design had better average survey rating than the fully researched state-of-the-art replacement you’ve been promoting. But you’re not about to give up. After all, there were only eight users in the test. Maybe that wasn’t enough. So you put on your most authoritative voice and try, ”Of course, we should do an inferential statistical test to see if this is just sampling error.” Whoa, dude, you know statistics? “Well, yeah, I took an intro course in college years ago, and so I know we should-” Cool, brudda. Go run the stats and tell us what it shows.

Uh oh. You remember what statistics is about, but do you remember how to do it? Lucky for you, you ended up on this web site. Here I’ll explain how to perform statistical analyses to answer questions like whether your sample size is large enough to draw valid conclusions. I’ll cover analysis of common measurement types used in usability studies and common usability test designs. That should be doable in a single blog post, no?

No. Sometime last month while writing this, It finally dawned on me this was too much for one post. So, instead I present to you a series of posts on statistics and usability, which, contrary to previous series like Learning from Lusers, I’ll post one after the other each month like a DJ playing a block of classic Led Zeppelin.

We start with:

Stat 101: Sample size, inferences about the user population, Type I and II errors, statistical significance, and statistical power.

Yes, Stat 101 from May 2010, which retroactively becomes the first in the series. Stat 101 covers the basic principles of statistics for those with no knowledge of stats. It includes tables you can apply to task failure rates observed in your usability tests. It concludes that:

- You can do statistical analysis of small sample size usability tests.
- You
*should*do statistical analysis of small sample size usability tests. - Doing so will often show that you have perfectly valid data from your small sample size usability tests.

If you haven’t read Stat 101, go ahead and read it, even if you’re pretty good with statistics. Stat 101 applies your typical academic statistics to the problem of user interface design, and thus there are some unconventional points, processes, and perspectives that aren’t covered in a typical college course that we’ll use in the later posts. And tell Wayne in Finance I said “hi.”

After Stat 101, we’ll go to the 200-series.

Stat 201: Comparison of averages, analysis of user ratings, improving on an existing UI design, estimating needed sample sizes, the normality assumption, between-subject testing, and one-tailed t-tests.

Stat 202: More comparisons of averages, analysis of completion times, choosing between two new UI designs, maximizing statistical power or “decision potential,” within-subject testing, two-tailed t-tests, and data transformations.

Stat 203: Analysis of frequency data, such as task completion rates and conversions, binomial tests, chi-square tests, and more.

And we’ll work with Surfer Stan.

The 200 series applies the concepts I covered in Stat 101, but moves beyond the simple case of how to interpret *x* number of users failing at task *y* in a usability test. Now were going to start looking at “relationship inferences,” which will help us decide which design alternative is better. At the same time, while Stat 101 is a pre-requisite, I’m not picking up where Stat 101 left off (that’s why it doesn’t start with Stat 102). We’re going to get into more sophisticated analyses, and I’m afraid I can’t compress an entire stat textbook into a few posts. The 200 series assumes you’ve had some stat, and the formulas I present are review -maybe a review from when there was such a thing as “record stores,” but a review anyway. If you are a usability engineer, and you’ve never had intro stats, what are you waiting for? Go take it now. Or at least read a book on it. Nothing like bending your brain around sampling distributions to wile away the Sunday afternoon.

How sophisticated will the analyses be? Well, you’re going to have to do some actual calculations this time, but nothing you can’t do with the built-in formulas of a spreadsheet. No need to rush out an plunk down the hefty licensing fee for SAS. We aren’t going to do any multivariate principle component maximum likelihood structural hazard modeling whatevers. Instead, I’m talking about the t-test, the chi-square test, and binomial tests (the latter actually introduced in Stat 101). The series is geared to people who occasionally need to compare human performance on a couple alternative designs. In particular, the 200 series is for UX practitioners who want to know which of two designs is better for the users, which is the most common statistical situation in usability engineering. I’m also assuming you merely want to make a reasonably well-informed design decisions, something better than, “oh, I guess that sample size is big enough,” and you don’t need high-precision academic-grade absolutely-the-best statistical tests. If you want the latter, then hire or contract a professional statistician with his or her own SAS license. All statistics is approximation. It’s just that some approximations are better than others. I’m aiming to go over some stats that are good enough for most needs in usability engineering.

]]>When I was a usability engineer for a telecommunications development firm, there was a big push for ISO 9001 certification, which was intended to indicate that we produced quality products. I wanted to produce quality products, but the truth was I didn’t know how to make quality UIs. In retrospect, I see that many of my designs back then were inferior, making it only half-way towards elegance. Windows were simplified, when they could’ve been eliminated entirely. Layouts were functional but cluttered and wasteful. Disabling was not used consistently. There was no vision. I wanted a process that would help me achieve quality.

Unfortunately, the definition of “quality” in ISO 9001 is different from an ordinary person’s definition of quality. “Quality” means “no defects,” which means “performs according to spec.” The trick, as I saw when we developed our quality process for ISO 9001, is that the spec or requirements can be anything you want, and can even be altered retroactively to match actual product performance. So ISO 9001 certification really meant having a process for painstakingly documenting precisely how low our quality was. If your binders are thick enough, and updated enough, you have “quality” no matter how superior or inferior the product you’re shipping.

One less cynical thing I acquired from my time in telecommunications is a high respect for the software testers. Part of the respect was because it seemed like such an incredibly monotonous but important job. The other part of the respect was from how they made me a better designer. Prior to true usability testing, functional software testing is actually a good way to find UI issues and requirements we didn’t know we needed. Testers were the closest I could get to usability test participants (the development team explicitly forbid me from doing any actual user research, apparently afraid I might lead users to believe they might be getting a quality product). Testers would notice usability issues from *actually using* the app, not running scripts for it. From this, I began to learn what makes true quality.

Any complicated project needs a system for documenting its goals and progress. There are too many details to remember and too many individuals to coordinate to avoid writing things down and tracking them. However, documentation is not equivalent to quality. Does an expert furniture maker follow a blueprint? Does a top chef follow a recipe? Being schooled in industrial engineering, I’m familiar with the widely accepted definitions of quality as “conformance to requirements” or “absence of defects,” but I’ve always felt these are missing something. Industrial engineering is specifically concerned with the design and analysis of product production. Thus, its definition of quality is limited to the quality of the production -the degree the finished product varies from design. It has limited relevance to software engineering where actual production (i.e., software duplication) has no variability at all, and the line between designing the software and developing it is blurred, or, in the case of agile development, non-existent. Production quality is only part of overall quality. Production is only one step of the product lifecycle. It doesn’t matter how little the final product varies from design if the design is crappy to begin with. Particularly with software and websites, we need to consider the *design* quality when assessing overall product quality.

So what exactly is overall quality?

My Aerostich Roadcrafter motorcycle riding suit is an overall high quality product. That’s me wearing it in the obligatory photo of myself on the obligatory About page, The Roadcrafter has excellent functional performance, protecting me from wind and rain. It’s durable, long-lasting, and trustworthy, with a stellar reputation of literally saving your skin in event of an involuntary transition from the sport of motorcycle riding to the sport of asphalt body surfing. It’s easy to maintain, being machine washable and dryable. It’s comfortable -Gortex makes it breathable while waterproof, and the foam in the crash armor is soft to slow velocities characterizing normal joint movements, while rigid to high velocities that would come from impact in a crash. It’s convenient, with many pockets to carry various riding paraphenelia (e.g., gloves, sunglasses, keys), and with a clever “upside-down” zipper design for easy donning. It’s versatile, suitable for fair and foul weather from over 100 F to sub-freezing. It’s tasteful and attractive. It’s made with superior design, top-notch materials, and excellent production standards to achieve its level of performance. These are the things that make any product a high quality product.

When you think of the high quality products in your life, you realize they are the things that provide you the greatest net gain. Whether you’re talking about a riding suit for consumers to wear, or a ball bearing in an industrial machine somewhere, a high quality product is one that delivers the greatest benefits for the least compromises for whomever is using it. That is, with the exclusion of one significant attribute of the product: the monetary cost. It’s understood that high quality often means shelling out more lucre. Some consumers even assume that more expensive products necessarily have higher quality. Certain software vendors and consultants have succeeded in increasing demands for their products and services by *increasing* their prices, contrary to what Microeconomics 101 would teach you. But let’s put the relation of monetary cost and quality aside for now.

Including everything else but price in the equation, the higher the cost-benefit, the higher the quality. A high-quality web site, for example, would have excellent layout and information architecture, readable font, elegant pleasing aesthetics, and correct and useful information, all things that benefit the user. Littering your web site with ads reduces the quality of the web site, because ads are a cost in the cost-benefit equation. Web site ads generally are distracting, ugly, and not useful to the user. More to the point, users don’t like them; they’ll actively attempt to avoid them. PBS is touted as “quality television” (I think that used to be their slogan) in part because they don’t interrupt their programs with ads, unlike commercial television.

The equation for overall product quality is quite simple:

**Overall Product Quality = User Experience Quality **

High quality is synonymous with an excellent total user experience. When we talk about improving the quality of a product, we’re really talking about improving the quality of the user experience. This rolls both design and production quality into the definition, because the quality of both affect the user experience. It also rolls in quality of availability, purchasing, service, support, maintenance, upgrading, and retirement, because these also impact the user experience. As a UX professional, you’re in the quality business. Achieving quality really means maximizing the dimensions of a positive user experience, which are:

- Usability
- Usefulness
- Desirability
- Reliability

You might recognize that the first three elements are among Peter Morville’s facets of user experience. You may wonder why I didn’t go on and list Morville’s remaining four facets of UX. It’s because they don’t belong. *Findable* and *credible* are characteristics specific to static web sites where the user goal is to retrieve information. I suspect that being an information architect gave Morville a limited perspective of the scope of UX and the products it applies to. I’m going to absorb credible into *reliable*, as I’ll discuss below. Findable is already absorbed; it’s simply the chief form of usability that static web sites need -make it easy to find the information. *Accessible* is also really a component of usability, just usability for certain select group of users. At least, that’s the way it should be in a quality mass-market product -accessibility shouldn’t just be ticking off a W3C checklist.

As for *valuable*, that facet isn’t a benefit to the user, but rather represents the faith that the other facets will deliver value to the product maker. Which is not always true. Valuable, along with Morville’s concept of credible, show that his facets are really a list of Things UX Professionals are Concerned About, which is not quite the same as things that make a positive user experience.

High-quality products are usable. They have high ease of use, making low demands for the users’ time, and physical and mental effort. They don’t hassle the user. They allow the appropriately trained and experienced users to maximize their performance. For example, a high quality sports car provides tight precise steering and easy to modulate brakes with superior feedback, allowing the skilled driving enthusiast to take curves quickly and confidently.

Usability implies a good fit with the user. High quality products also fit well with the user, being comfortable, clear, and convenient. High quality products are comfortable, fitting physically by having good ergonomics. Just as a high quality suit fits the wearer well without uncomfortable pinches or bunches of fabric, a high quality mobile electronic device also fits the user, being comfortable to hold and operate. High quality products are also clear, fitting mentally with the owner. For example, a good quality textbook will be clearly and precisely written, provide excellent illustrations, and be well-organized pedagogically, building on knowledge the readers already know. It will have a complete index with terms consistent with the users’ mental concepts, as they are both before and after reading the book. Likewise, a high quality web site will have well-organized content carefully prepared in a language familiar to users. Finally, high-quality products are convenient, fitting the owners’ lifestyle. For example, a high quality refrigerator will have features like ice and water available from the door to make one of the most common household tasks the easiest to complete. Likewise, high quality software includes features to make the most common user tasks the easiest to do.

Good quality products are *effortless*, For example, it’s easier to play well on a high quality guitar, with light strings and low action, than a cheap clunky-feeling knock-off. Likewise, your high quality websites should be effortless, allowing the user to flow through it. In your usability testing, you want more than just a high proportion of your users arriving at the right content. It’s not enough that users eventually complete the check-out pages and buy something. You want them to get there *quickly* with minimum pauses, typing, and clicks. Clear and consistent design minimizes delays associated with confusion or error. The rest is about designing in speed. A clunky-feeling website has inadequate servers and bloated content that takes longer than a few seconds to download. A high quality web site delivers content at desktop-speeds -half a second or less.

Clunky websites or apps have tedious unnecessary steps, forcing the users to jump through hoops to get their content. High quality site or app design minimizes excise, seeking to support a task with the fewest the number of clicks. You can become over-concerned with click-counting. For example, throwing everything on a single web page would certainly minimize navigation, but can result in so much cluttered and confusing content that users spend more time and effort searching for their target information than they would if they had to click a couple times. Long, unorganized, unbroken, unlabeled, un-scannable content can waste just as much time and effort as clicking. Likewise for verbose, vague, and task-irrelevant text that takes a long time to read, whether its copy on a web page, captions for input controls, or messages in an error.

Nonetheless, when you’re at the design stage, prior to usability testing with a stopwatch, click-counting remains a pretty good way to estimate time and effort to complete a task, and *unnecessary* clicks are a clear sign of low quality. For example, when setting up my automatic backup with InSync, the Destination dialog opened each time at the root of the destination drive, forcing me to click through the directory hierarchy to select each destination folder for my documents, bookmarks, and email. Of course the paths for the documents, bookmarks, etc. are very similar, so this involves a lot of repetitive clicking. Some developer got lazy and didn’t bother to preserve the selection from the last use of the Destination dialog.

One project manager I know likes to think we’re all born with a fixed number of clicks to last our life, which should not be squandered. In a sense, he’s right: on a familiar web site, a click takes about two seconds, including time to move the mouse, but not including page load time. We only live for so many seconds. Personally, I rather not spend it clicking. If you run into a developer that insists that one little unnecessary click is just not that big of a deal, show them the following code (if you pardon my untested pseudocode):

`Integer sum10s(Integer max) {`

// Sum all values divisible by 10 that you get

// when counting up to max.

// Return negative value for negative max.

Integer sum = 0, ispositive = 1;

Character str[12];

IntegerToString(max, &str);

if (str[0] = ’-’) {

`str[0] = ' ';`

max = StringToInteger(str);

ispositive = 2;

` }`

for (Integer n = 0; n < max; n++) {

`IntegerToString(n, &str);`

if (StringToInteger(str[strlen(str)-1]) == 0)

`sum += n;`

}

`if (ispositive == 2) sum = 0 - sum;`

return(sum);

}

Your developer should say, “WTF?” and promptly replace it with something like:

`inline Integer sum10s(int max) {`

// Sum all values divisible by 10 that you get

// when counting up to max.

//` Return negative value for negative max.`

`return ((max/10)^2 + max/10) * 5;`

}

Other than the original code being an offense to all that is holy, why should anyone care about it? Well it’s slower. It’s harder to maintain. It’s buggy (most egregiously, it returns the wrong value when max is exactly divisible by 10). It’s obviously poor quality code. But practically what difference does it make? How many microseconds will it save by the second version even for large values of max? How much more work is it to maintain –will it ever need maintaining? What’re the odds and impact of any bug (e.g., is max ever exactly divisible by 10)? Is the original code worth re-writing?

Probably.

The developer’s reaction to the original code is the same as my reaction to a interaction design with an unnecessary click. What’s one little extra click?

- The task takes a couple more seconds to do.
- It’s one more thing to support the users on (training, documentation, help, and tech support).
- It’s one more chance for the user to mistakenly click the wrong thing.

In other words, it makes the user’s process slower, it’s harder to maintain (your users), and it’s interaction-buggy. The two seconds associated with an unnecessary click dwarfs any processing-time savings you’d get by fixing your basic Code of Abomination (your abominations may vary). For most applications, refactoring some code is less important for the user experience than saving one click, better organizing and labeling the content, or editing and tightening the text.

High quality products are maintainable, in which maintenance tasks are usable just like the operational tasks are usable. A high-quality lawnmower is not only easy to start but easy to prepare for the season; for example, the owner can replace the air filter without any tools. Service for your a high-quality car includes the convenience of providing a loaner car to the owner.

If you’ve read other stuff on this site, you’re probably not surprised I gave usability top billing on the dimensions of user experience, since that is what I tend to emphasize the most. However, that’s merely a reflection of my expertise. As it happens, I don’t think that usability is the most important dimension for UX, or, by extension, quality. Usability only reduces cost -the effort of using the product -but it doesn’t represent any benefit.

*Usefulness* is all about benefit. Usefulness is the functional benefit a product provides the user, the degree the product fulfills its operational purpose. It’s how well a jacket keeps the owner warm and comfortable and how good of pictures a camera takes.

For a car, it’s how quickly and safely it goes, stops, and turns. I once had a sociology professor named Mike Johnson who was quite vocal in his contempt for status symbols, products acquired not so much for their functional benefits for the owners but because they are believed to impress others -friends, family, and business associates. Mike Johnson was not impressed with people out to impress other people. Decades ago, before you could find anti-lock brakes on cars, Mike visited a friend in California who had just bought a Mercedes. Mike didn’t say anything, but his friend knew how Mike felt about such things. So, the friend is driving Mike down an empty highway at night, and the friend says, “Hey, Mike, watch this.” He takes both hands off the steering wheel and slams on the brakes, locking all four wheels. SCREEEEEEEEEEEEEEECH!!! The Mercedes skids to a stop in a perfectly straight line. The friend turns to Mike and says, “*That* is why I got a Mercedes.”

I can only assume Mike replied with, “OK! OK! I BELIEVE YOU! I BELIEVE YOU!”

There is a difference between usefulness and usability. The usefulness of web sites is the usefulness of their content, regardless of how easy it is to access it (the usable dimension). The usefulness of an application is the the degree it takes input and processes it into a correct and useful product for the users, regardless of how easy it is to provide the input, set the processing parameters, or read the output.

We can see the difference and importance of usefulness relative to usability in my experience with the web site of Garmin when I attempted to check on the status of a repair order I had with them. Garmin makes of aircraft navigation equipment, among other things. You would expect they would have a high appreciation of the importance of quality. However:

- After logging in to MyGarmin and clicking on “Repairs and Exchanges” I was confronted with the message, “You currently do not have any devices that have been exchanged for repair,” which sounds like they never received my returned device, but what they really mean is they
*don’t currently have*my device. At the very least, the message should’ve been “*We*currently do not have any devices that you have sent in for repair” (emphasis added for illustrative purposes).**Design Weakness 1**: Confusing copy. - Fortunately, there was a place to enter the RMA number they gave me earlier to return the product, but that meant some shuffling through my digital papers to find the RMA number.
**Design****Weakness****2**: Forcing the user to use memory instead of recognition. It’s not like Garmin doesn’t know my RMA numbers, and it’s not like there is a security concern with listing them with a date and letting me choose -I’m logged into my Garmin account, having given them my user name and password. - After submitting the RMA number, I was prompted to re-enter my login credentials.
**Design****Weakness****3**: Forcing the user the repeatedly enter the same data. Either that, or the flaw was timing out my session after only the few minutes it took me to retrieve the RMA number, time they should have allowed since they’re the ones that required the number. - Then I was again prompted for the RMA number.
**Design****Weakness****4**: See Design Weakness 3. - Then I got an error message, saying the RMA number should be digits only and no letters, even though I entered exactly the alphanumeric RMA number that Garmin itself had supplied me.
**Design****Weakness****5**: Intolerant entry formats.**Design****Weakness****6**: Inconsistent formats used throughout the system. - I stripped the letters from the RMA number and re-entered it, and finally got a page saying the device had been repaired and shipped Thursday. Now, a high quality site would have simply provided that information when I clicked “Repairs and Exchanges” back in the beginning -they could’ve listed the status of the most recent repair orders, which I’d bet would fulfill the goals of the vast majority of the users.
**Design****Weakness****7**: Requiring unnecessary steps.

All the design weaknesses above are usability weaknesses, and in aggregate they indicate a low quality web site. However, there was also a usefulness design weakness in that they didn’t actually provide me with the information I really needed. Certainly it is somewhat useful to know that they received and repaired the device and shipped it back to me, but what I really needed to know at that point is *when will I get it back*. They had neglected to say how it was shipped or to provide me with a tracking number. If they shipped it by first class mail, I could see it in a couple days. Shipping by UPS ground could mean a week. Shipping by continental drift….

Even if they had fixed all the usability problems in Design Weakness 1 though 7, I still would have had at best a mediocre user experience because of low usefulness. It doesn’t matter how easy you make it to get the content on your web site if it isn’t the right content. Failure to correctly determine the content your users need will result in low quality.

Enjoy your flight.

We can also see the importance of usefulness in the success of other websites despite usability problems. Craig’s list, eBay, Flickr, Wikipedia, and YouTube all have or have had usability and other design problems. Yet each dominates their respective domains by being more useful than their competition. This wasn’t so much due to deliberate design as an artifact of having the most users. With user-supplied content, the network effect increases a site’s usefulness. If you’re looking to buy or sell something, it’s most useful to go to the site with the most sellers and buyers. If you want photos or videos, it’s most useful to look where there are the most photos and videos. The success of these sites doesn’t mean that usability or other dimensions of the user experience don’t matter. They merely mean that users will suffer through something of questionable design if the product or service provides something that is sufficiently useful.

Related to usefulness and usability is versatility, or the capacity of the product to remain useful in a variety of environmental or user conditions. A Gortex jacket, for example, that keeps the wearer comfortable in wet and dry conditions is more versatile, and therefore better quality. A chair with adjustments to best support a range of sitter physiques is more versatile, and therefore better quality (provided the adjustments are usable so more than just high-end sitters take advantage of them). High-quality web sites and applications maintain their usefulness for a range of users and environments too. High quality mobile apps choose graphic codes that are still distinguishable on devices used in bright sunlight, for example. High-quality photo manipulation software has features useful to both amateur and professional photographers.

Desirability also represents the benefits a product provides, but while usefulness comprise the extrinsic benefits, desirability comprise the intrinsic benefits. To be useful, a product has to be useful *for* something else. That is, it’s function is to be used to achieve some goal or state in the world outside of the product. An air conditioner functions to chill and dehumidify the ambient air. An mp3 player functions to play music. Presumably achieving these states is emotionally fulfilling to the users, which is why they use the product.

In contrast, to be desirable, as defined here, the mere use or ownership of the product is directly emotionally fulfilling, regardless of any state of the external world. This includes the degree the product looks, sounds, and feels attractive, the degree it is fun to use, and the degree it reflects well symbolically and socially on the user (i.e., the degree it is a status symbol -the kind of things that Mike Johnson hates).

Generally usefulness is the degree the product fulfills its primary purpose, that it fulfills the function it was engineered to accomplished, while desirability is the degree the product fulfills its secondary collateral functions. However, usefulness is not necessarily the primary design challenge in all products. For consumer commodity products, the usefulness is a given, and desirability is the only way to increase benefits. For games and other forms of entertainment, the primary purpose is to directly elicit an emotional response from the user, and usefulness and desirability merge.

Desirability is handled primarily by the styling of the product, that is, by its appearance: how it looks, sounds, and feels; the degree it is attractive or tasteful. This extends to more than just appearance of the product itself to anything associated with the product: it’s packaging, advertising, point of sale, even the people seen using it.

Being an emotional response, desirability is subjective, which implies that what is high quality to one person may be low quality to another. However, I think there are some characteristics of an aesthetic design that characterize high quality:

*Well-formed*. High quality products follow the classic rules of attractiveness, without being cliche. Colors are chosen that look well together. Shapes are simple and elegant. Balance is appropriately applied. Alignment follows a neat grid. Height-to-width ratios are reasonable. Form is not distorted to shock, disorient, or depress the user. A quality product implies a *positive* user experience, which implies the aesthetics should primarily elicit positive emotions. We often associate quality with a quiet elegant visual style. A calming response is a positive emotion, so this approach is a proven winner. However, it is not strictly necessary to be quiet to be high quality. High quality can be bold. Excitement and passion are positive emotions too. The trick is to be bold without being gaudy.

*Integrated*. Aesthetic elements should be built into the product, not tacked on as an afterthought. They should be aesthetically consistent with each other. A poor quality web site, for example, will be peppered with clip art of various styles. A high quality site will show aesthetic integrity. A poor quality app will slap on badges, puzzles, cute graphics, and other gimmicks to try to be fun. A high quality app will richly represent the domain, being fun by emphasizing what is fun about the domain. A quality product should not be an aesthetic experiment, a creation the designer glued up from disparate parts and threw to the consumers to see what happens. While that may win design awards, for quality products, all the experimenting should be done before its unleashed on the public. The designer should know exactly what impact to expect.

*Purposeful*. The choices in aesthetic design should be selected with a reason -the aesthetic elements must serve a purpose. The purpose may be to enhance usability or usefulness, but it can also be a purely aesthetic purpose -the choice may be used to communicate something on an emotional level. For example, you can make a control stand out by using bold print or bright red font. From a usability perspective, both can work. However, each has a different aesthetic experience due to different associations the users have with boldness and redness. In a high quality product you deliberately choose the kind of experience to provide. Do you want give your users a newspaper headline experience or a panic button experience? A poor quality app, aesthetic elements are chosen haphazardly with little thought other than it looks nice. Designers blindly follow fashion, or apply a new technology just because it’s new, or because that’s the only thing the designer knows how to do.

*Honest*. The aesthetic design should be consistent with the other dimensions of the user experience. High-quality mountain climbing carabiners are usable, useful, and reliable and should also *appear* usable, useful, and reliable: The gate should be visibly knurled assuring ease of operation with cold or gloved fingers. It should feel light but rigid in the hand consistent with taking life-depending loads while not weighing the user down. It should snap unambiguously shut indicating a high probability of working in remote regions.

When products cannot meet a certain dimension of quality in some way, they should appear to not meet it: a delicate china cup should look and feel delicate. Likewise for high quality apps and web sites. If your interface looks simple, then the UI better be simple to operate. If your app is powerful, then it should look it.

A reliable product is one with a high probability of working in operational situations. An app that crashes or exhibits bugs is low quality. A web site with some incorrect information or some broken links is low quality (in this way, credibility is rolled into reliability). High quality apps and sites work. The cost of low reliability to the user is greater than the sum of the individual problems it creates. A low probability of working harms the user experience even when everything goes well. A history of low reliability can instill dread in the user, a fear that a failure may occur any time, making a negative emotional experience. Even when the low reliability is predictable, the experience is compromised. Usability is reduced as users have to perform work-arounds to avoid failures, resulting in greater effort, time, or a compromised outcome. They may even avoid using the product at all, reducing usefulness too. How useful is a luxury car, no matter how good the ride, if it never leaves the garage because it might break?

Reliability brings us back to the original intent of the industrial engineering definition of quality, namely, being without defects. A defect in this case is not merely failing to match a spec sheet or requirement, but an aspect of a specific product that keeps it from performing its useful function at a consistently high level. It doesn’t matter what’s on the spec sheet really. What matters is what actually happens in the operational environment and whether the product continues to perform. A suitcase whose handle breaks off with normal use is low quality regardless of the design load given in the requirements. Likewise, an app that spits out useless results under certain normal conditions is low quality no matter how much the code is proven “correct.”

But I’m defining reliability, and by extension, quality, to be broader than just an absence of defects. High quality products have, not only a high probability of working when new, but of continuing to work over many operations and a long period of time -the product should be durable. A quality laundry basket should last a long time before wearing out and breaking, for example. The useful performance of a quality database app should not deteriorate as records are added. The content of a quality web site should be regularly reviewed and updated to maintain accuracy. A quality product consistently works now and in the future.

I’m also expanding the definition of reliability to include more than reliable usefulness. In this case, by “works,” I mean maintains all of the other three dimensions of the user experience -usable and desirable, as well as useful. In the case of usability, consider Design Weakness 3 in the Garmin example above. I’m not sure why I was prompted to login again, but I’m guessing it doesn’t happen to everyone all the time. It was a case of episodic usability failure; the site exhibited unreliable usability.

Desirability likewise can be reliable or not, especially when one considers durability. A low quality wristwatch will acquire a scratched crystal and chipped paint with use and age, detracting from the aesthetic pleasure it provides. A high quality watch maintains its attractive appearance, or, at least, acquires an attractive patina. Likewise, a high quality web site will have a style that doesn’t get old, and doesn’t go out of fashion.

There is a distinction between true quality and apparent quality, the latter being the quality of a product as perceived by the user. Such perceptions may be based on superficial assessment of the known attributes of the product. For user experience, true quality is more important than apparent quality. Users’ true experience matters more to them than the experience they think they had. That makes more sense than it sounds. Consider an extreme case of a low quality carabiner that’s perceived as high quality. Sure, the rock climber was feeling mighty happy to hang it on his harness as he approaches El Capitan. But when the carabiner fails spectacularly 400 feet up, one can rest assure that the climber’s total experience for the day is profoundly negative, granted the negative part is also relatively brief.

This is the issue I have with credibility as the term is applied to web sites. The term means the site is *perceived* to have reliable information. Credibility is an important benefit to the site owner, but not necessarily to the user. What’s going to maximize a positive user experience is for the site to *have* reliable information. That’s going to help the users make the best decisions and plans that produce the best total experience for them. It’s about being trust*worthy* not *trusted*. If your site has reliable information, then you certainly should make it look like it has reliable information -that’s part of honest desirability. But a site that steers users the wrong way but seems right is worse that a site that acknowledges the unreliability of its information.

I’ve already mentioned that consumers tend to assume that higher price products are higher quality because indeed high quality products do tend to cost more to produce, for reasons, I’m still, at length, getting to. Material composition is another attribute that consumers use to judge quality. High quality materials are often necessary to make a high quality product. Gortex makes motorcycle riding suit wind and water resistant while still being breathable, unlike the urethane-coated nylon suits sold before the advent of Gortex. Full-grain leather is unmatched for abrasion protection, plus it’s comfortable and durable, so that is an alternative sign of a high quality motorcycle suit. Likewise, you would not expect a high quality app to emerge from a buggy, limited, and all-around low quality API library.

However given that quality actually means user experience, quality materials are only a means to an end. For true high quality, “quality” materials are only relevant if they provide an enhanced user experience. Quality is enhanced *any* way you can make a motorcycle suit wind, water, and abrasion resistant, breathable, and durable, no matter what materials or manufacturing methods you use. Marketers may attempt to boost apparent quality by referring to Corinthian leather seats, whatever the hell that is, but the function of a seat is to provide a comfortable place to sit. What really matters how it feels. Is it properly shaped? Is it too hot? Too cold? Does it stick to you? Does it look and feel good? How will it last?

Apparent quality can also be influenced by perceived engineering trade-offs. From experience, consumers notice that excellence in one dimension of user experience tends to imply a deficit in another. Something that performs highly on an aspect of usefulness may be expected to be difficult to operate, lack versatility, require finicky maintenance, have an unpleasing utilitarian appearance, or be prone to break down. Sometimes this is true, because engineering trade-offs are real. Think race cars. Think F-117 “stealth fighter,” a warplane with proven high usefulness but so ugly, it doesn’t so much fly as gets repelled by the ground in disgust. The F-117 also requires a massive crew of specially trained technicians to maintain it and keep it useful.

But for most consumer products, high performance on all dimensions of quality are achievable, yet some people still apply the same rule of thumb. They assume that high usability implies a dumbed-down low capability product of questionable usefulness. Or they believe there is a zero-sum trade-off between an exciting attractive appearance and usability. In the latter half of the 1980s, IBM successfully marketed its crude PC against the revolutionary Apple Macintosh because consumers believed that something as nice-looking and easy-to-use as the Mac couldn’t possibly be a useful machine, when in fact, the Mac was more useful. In actual design, what’s usable and useful for novices is usually also usable or useful for experts, or, at least, they are not antagonistic. Usually beautiful form evolves from excellent function. Think SR-71 Blackbird, rather than F-117.

Whatever the trade-offs, except for some special-purpose niche products, the best user experience and thus the best quality is achieved by balancing all the dimensions. Ultimately if a product gives up too much on some dimensions for the sake of others, it’s no longer high quality at all.

If you seek to produce high-quality products on all the dimensions of the user experience, there is one trade-off that is hard to avoid, and that’s price. It’s not impossible to have something cheap that’s also high quality, and certainly there are many expensive things that are in fact mediocre or even low quality. However, on average, it takes resources and time to make something that is highly usable, useful, desirable, *and* reliable. That means high development and production costs.

Achieving high quality is simple. Average-quality products have just the average attention to performance. To achieve high quality, you have to do more. You attend to details and accept only excellence on all dimensions of the user experience. You don’t release the product until analysis, inspection, and testing confirms it.

It means taking the time and effort to study, design, and develop cascade menus which respond well to diagonal mouse motion so it’s slightly easier to pick a menu item with a single smooth slew. It means you fuss with the parameters for response to finger swipes until the scrolling and “bounce” look and feel natural. It means working out how to include side tone in your cell phone so users feel like they’re having a normal conversation, not shouting at a tinny little device. You check and edit and recheck your web site’s content for alignment, visual consistency, smooth transitions, legibility, spelling, grammar, clarity, and correctness. It’s the kind of work that has given Apple its reputation for quality products.

An ordinary user experience is relatively easy to achieve -everyone knows how to do it, and because there is a large market for, there are plenty of resources available to get you there, so it doesn’t take too much work. For high quality, you make the extra effort to polish the product, to get it beyond the ordinary. All this extra effort costs more, and the incremental cost increases the further you get from the ordinary, the more you get beyond what’s widely known and easily available. It means spending $10 per production unit to make a $100 product just 1% better than a $90 product, but you do it anyway.

Because of the potential for high costs and therefore high price, high quality products don’t make sense for every business. Perhaps it makes more sense for you to pursue high value for your users, which is simply quality divided by price (this is different than the kind of value for the client that Morville is talking about). If you can find a way to deliver ordinary quality at a lower price, you’re also doing your users a service and you should find lots of buyers. Everyone likes a good value, but not everyone is willing to pay the price for high quality -there’s other things they want to spend their money on. Before you commit yourself to high-quality products, look into what you can achieve for the price your users can pay.

**Potential Solution**: Design and build to create the most positive total user experience. Apply additional time, effort, and money to achieve extraordinary levels of:

- Usability -the ease of use of the product.
- Usefulness -the capability of the product to fulfill extrinsic user needs and goals.
- Desirability -the intrinsic attractiveness of the product.
- Reliability -the chance the product remains consistently usable, useful, and desirable now and in the future.

A continuation of Putting the G in your GUI.

The primary reason for including colors in your UI can be understood by the logical synthesis of two basic principles:

- Colors are pretty.
- Users like pretty.

But what about the usability side of UX? Can colors be used for more than aesthetic reasons? Do they improve user performance of a task? Well, usually not. Various studies going back to old character-cell displays when color was first introduced generally fail to find an advantage for color. This led me in the days of DOS to purchase a monochrome amber monitor for my first computer, while my coworkers were investing in more expensive sixteen-color VGA units. Yeah, nice blue background you got there for Wordperfect, dude, but does underlined text actually appear underlined? Or does bold look bold. Hmph!

It’s pretty much the same for the web. Despite an increase in the number of colors we can display by several orders of magnitude, the way colors are typically used does little to improve performance. Coloring your menu items, headers, pictures, and different sections of content generally isn’t going to make content easier to find and see than a black-and-white version of the same. Usually you can group, emphasize, and distinguish content just as well with spacing, lines, and font size, style, and weight as with color. You need to color your links so users will recognize them as links, of course, but only because we’ve so successfully set that expectation. If the web from the start used a non-hue convention to distinguish links (like an icon or a gray background), I expect performance would be the same, maybe even better. Users often think that color improves their performance, but it’s all in their mind, you know -a halo effect of the more pleasant experience that color brings. For most cases, designers are concerned with not *degrading* user performance when introducing color for aesthetic reasons, rather than enhancing performance with color.

However, there is one area where color has the potential of enhancing user performance, and that’s the use of color coding, where color is a graphic dimension used to represent an attribute value of a data object. By color, in this case, I specifically mean hue, but saturation and brightness are also typically varied in color codes, especially the latter for accessibility reasons. Color makes things look distinct. Of all the graphic dimensions, only position and maybe shape rival color in making things look different for most practical purposes. This makes color coding helpful for the following user tasks, ranked in order of suitability:

- Pattern recognition.
- Visual search.
- Value Identification.

Often the same display needs to support all three tasks.

To see how color can help with each of these tasks, consider the following symbol set for the icons or object controls used to display fanciful naval assets on a tactical display. We’re using shape coding to represent the class of the object (type of naval asset, in this case), which is often the most sensible thing to do.

Symbol | Class | Abbrev. |

Main Echelon for Naval Engagement | MENE | |

Ballistic Operations Naval Combat Regiment | BONCR | |

Tactical Unit for Reconnaissance and Combat | TURC | |

Combat Long-range Operations Unit – Naval | CLOUN | |

Ground-support Levitating Unmanned Vehicle | GLUV | |

Submarine | SUB |

With shape used to represent the class of the asset, we’ll use color to represent it’s current function:

Color | Name | Function |

Red | Offense | |

White | Force Protection | |

Magenta | Intelligence | |

Blue | Command and Control | |

Yellow | Special Forces | |

Green | Logistics |

The example display below lists the assets, with the icons using both the shape and color codes. The table is sorted by class of asset, and then by identifier.

Pattern recognition is the spontaneous perception of relations between attributes. In the window above, for example, a casual scan of the table reveals that command and control is primarily the responsibility of the MENEs and vice versa. A solid block of blue icons in the “MENE” part of the table, and a lack of blue elsewhere makes this apparent. Color creates a strong sense of grouping which emphasizes such patterns more than other graphic attributes. We could have tried representing the assets’ functions with a different graphic attribute than color. For accessibility reasons, the asset functions are redundantly represented by the shape of the outline of each icon, but this just isn’t as good as color. It’s unlikely the pattern would jump out as well if color wasn’t also used.

Of course, with a table layout, the user could sort the data on Function then scan the Class attribute and see that nearly all command and control is by MENEs, but the user has to think to do that. With color coding, the pattern is detectable without the user interactively looking for it. Besides, sometimes sorting isn’t an option, such as in this graphic layout below, a map showing the locations of the assets out at sea.

Here we see another pattern emerging: the command and control assets are all grouped in the central area surrounded by a ring of force protection units; offensive units tend to be more to the east. The separating effects of color are so strong, they in effect make overlapping data appear to have layers, with all things of the same color having their own layer. This is especially helpful when users need to attend to one pattern among overlapping patterns. Users can mentally filter out all but the color of interest to see the relations.

Color coding improves user performance on pattern recognition as long as meaningful patterns are associated with specific values coded by the colors. However, it can work against the users when the patterns depend on something else. For example it is probably less apparent that the BONCRs, whatever their function, are clustered in the upper right. The eye tends to see the different colored BONCRS as separate even though their icons are the same shape.

Sometimes the users can become so focused on the pattern among objects of one color that they fail to see its relation to objects of another color even when they’re looking right at it. They also may come to over-rely on color and fail to notice other information, such as text fields, that may qualify the interpretation of the color code. For example, a glance at the map may imply a pretty uniform distribution of logistics assets, but closer inspection of the lower left shows logistics being handled mostly by CLOUNs. I don’t mean to insult anyone, but those CLOUNs have no business doing logistics. Their transports are too small, for one thing.

The separation tendency of color limits how much you can use it. For pattern recognition purposes, you generally have to limit color coding to a single attribute. It’s probably not effective, for example, to color code the asset’s function in one part of the object image, and color code its readiness in another part. This can make the object no longer appear as a cohesive whole. Furthermore, users will tend to see the same colors for different attributes as representing the same thing, disrupting the users’ ability to see patterns in a single attribute.

Putting it together, and it’s apparent that with the great power of color comes great responsibility. You probably should only use color coding for an attribute whose values are reliably known and are highly task relevant.

Looking at the map above, find the asset responsible for Special Forces. You probably found it right away, likely faster than if I were to ask you to find the GLUV (there’s only one, not a pair), even though I made an effort to make each icon visually distinct. You’d also probably be faster and more accurate at counting the number of assets responsible for intelligence versus counting the number of TURCs.

Visual search tasks like looking for Special Forces and intelligence assets leverages the capacity of colors to visually separate objects, just like pattern recognition. However, visual search is a different task than pattern recognition. With pattern recognition, the user does a general scan and notices that certain objects in certain positions tend to have certain relative attribute values. In visual search, the user is looking for a specific attribute value.

Not appreciating the difference between pattern recognition and visual search can lead to a poor use of color, and that’s relying on *changes* in color to be meaningful to users. Users can easily scan a display and find a specific color, but unless they happen to be looking at just the right time in the right place, they will tend to miss something changing color. This is especially the case if the color the object assumes is not unique on the display -it’s not enough to constitute a major change in the general pattern. For example, if a couple BONCRs shift from a logistics function to offensive, don’t count on the user noticing right away. There’re lots of red offensive assets and a couple more is not going to pop out.

If a change in an attribute value requires prompt user response, then instead use some sort of *repeating* change or motion (time coding), like blinking or other animation. Humans are great at detecting cyclic variations in brightness in the peripheral vision, so such apparent motion will tend to get attention even at locations the user isn’t attending to. Human color discrimination, in contrast, is poor in the periphery, so hue changes alone are not so good even if they’re cycling. Of course, constant motion can draw *too* much attention, becoming a distraction, so you want to provide users with the ability to acknowledge and suppress the motion. At that point, the user knows the change has occurred, so there is no need for continued motion. Instead, however, the user may need to visually re-acquire the changed object, but now the task is visual search, which color coding is good for.

What is the function of the GLUV (in the center-right of the map above)? This is value identification, where the user is looking at an object image and must determine the value of an attribute, a task color can be good for because of the relatively wide number of colors we can recognize and distinguish. Text is generally the best for value identification -if users need to tell the function of the GLUV let them just read “Offense” from a Function text box. However, adding text attributes to your object image can be cluttering, and if you have graphic representations for other reasons, they may already be sufficient for value identification.

Among graphic dimensions, shape is generally best for identifying categorical data -users can recognize a remarkable number of different shapes, especially if the shapes have some sort of visual association with the concept they represent (e.g., the use of a stylized periscope to represent a submarine in the symbol set we’re using here). However, if you’re already using shape for one attribute (e.g., object class), color is a good choice for second attribute (e.g., function).

Consider the following graphic pane displaying seawater salinity represented with shades of blue and green, which is relevant for controlling the depth of submarines:

And consider these two questions:

- How is a salinity of 3.4 versus 3.8 distributed over the sea?
- How are high versus low salinity distributed over the sea?

These are both pattern-recognition tasks, but the first concerns the *absolute* values of the attribute represented by color, while the second concerns the *relative* values. Visual search likewise can be for absolute or relative values:

- Where is salinity of 3.2 found?
- Where are there “holes” in the salinity?

And value identification can also be absolute or relative:

- What is the salinity of the water where the submarine is now?
- Is the salinity more or less than that found immediately south of the submarine?

Graphic dimensions in general often are best for tasks concerning relative values. Humans can distinguish minute differences in size, orientation, shape, and color as long as the different representations are right next to each other. You can represent thousands of different values and your users will see the differences.

Graphic dimensions in general and color in particular are generally much less effective for tasks concerning absolute values. Position, size, and orientation may be used to read fairly precise absolute values for numeric or ordinal data, but only if you include an integrated reference (e.g., labeled gradation lines or a coordinate grid) so the task is really about relative comparisons. Otherwise you probably wouldn’t want to try to represent more than three or four values with these graphic dimensions if the user needs to identify the values reliably. You can have small, medium, and large, you can have horizontal, vertical, and diagonal, but any more is pushing it. Likewise with density coding. Crispness and the time dimensions? Probably best to keep each of them to binary attributes.

For reading absolute values without a reference, color is better than all other graphic dimensions except shape, but the number of color codes should still be far fewer than you can effectively have for reading relative values. A user’s ability to recognize a precise color when it appears in isolation is limited because the “same” color doesn’t always look the same.

- Despite standards, different displays (e.g., monitors, mobile devices, printers) render colors differently from one another due to differences in their design. What looks like amber on one display may look more like light brown on another. As your users move among displays, the potential to confuse colors grows. The extreme case is when they print your beautiful multi-colored data on a black-and-white printer.
- Even the same display will show different absolute colors under different conditions. Brightness settings, ambient light level and spectrum, display age, and viewing angle (for LCD displays) all affect the appearances of the colors. Is there a chance your users may be wearing sunglasses? What tint?
- Different users see colors differently. There is a fair minority of people with various forms of color perception deficiencies, which limits the differences they see among colors. A small minority of people are completely color blind, seeing only shades of brightness.
- Colors make colors look different. There are contrast effects where the color next to another changes how it looks. For example looking at the salinity display above, is not apparent that the “knoll” in salinity in the extreme upper left actually is the exact same salinity level as the “hole” in the lower right.

Put it together, and you should have no more than six color codes for reading absolute values.

Yes, so few colors limits the precision that of the display. Where once we represented the continuous range of salinity, now we’re only showing discrete ranges. However, your users are already limited in the precision they can identify absolute values. Showing greater precision is not going to change that.

Even then, you should generally include redundant non-color representations for accessibility and printing reasons. In the re-done salinity display above, for example, there is redundant shade coding in addition to hue, so that higher salinity levels appear brighter as well as greener. If your chosen colors represent some sort of scale, then the brightness values should be consistent with that scale. For example, if you have green-yellow-red as a scale of safety, make sure your green is lighter than your yellow (hint: try amber instead). Below I describe the formulas for calculating and comparing the brightness of colors.

Limiting the number of color codes does not eliminate the contrast effect. In the display above the upper left knoll looks lighter than the lower right hole. But with only five colors to choose from, users will likely recognize them as representing the same salinity. In an actual implementation, I might also show the salinity values of each region as text, perhaps on mouse-over to minimize clutter if other things are shown the on map.

You may be able to get away with more than six color codes if you provide a legend or key, like I’ve included in the salinity examples above. This is the color-equivalent of providing a reference marks for size, orientation, and position coding, changing an absolute-value task to a relative-value task. However, unlike a coordinate grid for position coding, the legend is not integrated with the colors on the display, so it doesn’t allow direct comparisons. For all graphic dimensions, the further the separation in time and space, the harder it is to make relative comparisons. A key won’t help color contrast issues, for example.

Having decided on the number of color codes to have, you now have to decide which colors to use. Typical digital displays can render over 16 million colors, but there are several considerations, some mutually contradictory, that constrain your choices.

Obviously, your colors need to be distinguishable from each other. You want to avoid the difficulties MS Office 2007 presents its users with too little color difference between active and inactive title bars of the windows (click for full size).

I don’t know about you, but unless I have two windows side by side, I can’t tell at a glance if a window is active or not.

Fortunately, you can estimate the perceptual differences between any two colors by converting the colors’ RGB value to Luv coordinates, a color space where the Euclidean differences between colors is proportional to human perceived differences.

Follow these 5 easy steps, for which, as an example, we’ll perform on a lovely shade of carnation pink (#FF99CC, or 255, 153, 204):

1. Convert your RGB values into “rgb” values using the formula x = (X/255)^2.2

r = (255/255)^2.2 = 1.0000

g = (153/255)^2.2 = 0.3250

b = (204/255)^2.2 = 0.6121

2. Matrix-multiply your rgb values by a constant matrix to convert them to XYZ color coordinates. As three formulas (rounded to four decimal places), it’s:

X = 0.4124*r + 0.3576*g + 0.1805*b

Y = 0.2126*r + 0.7151*g + 0.0721*b

Z = 0.0193*r + 0.1192*g + 0.9505*b

The Y dimension, incidentally, is the relative brightness value, which will come in handy later.

So carnation pink is:

X = 0.4124*1.000 + 0.3576*0.3250 + 0.1805*0.6055 = 0.6391

Y = 0.2126*1.000 + 0.7151*0.3250 + 0.0721*0.6055 = 0.4892

Z = 0.0193*1.000 + 0.1192*0.3250 + 0.9505*0.6055 = 0.6399

3. Calculate L, the gray-scale brightness level:

IF Y/0.9999 > 216/24389, L = (Y/0.9999)^(1/3) * 116 – 16

ELSE, L =24389*Y/0.9999/27

For carnation pink, 0.4892/0.9999 is more than 216/24389, so:

L = (0.4892/0.9999)^1/3 * 116 – 16 = 75.41

4. Calculate u’ and v’ for the two hue dimensions:

u’ = 4 * X / (X + 15*Y + 3*Z)

v’ = 9 * Y / (X + 15*Y + 3*Z)

If X, Y, and Z are all 0, then set u’ and v’ to 0. For carnation pink:

u’ = 4 * 0.6391 / (0.6391 + 15 * 0.4892 + 3 * 0.6399) = 0.2583

v’ = 9 * 0.4892 / (0.6391 + 15 * 0.4892 + 3 * 0.6399) = 0.4449

5. Convert u’ and v’ into u and v coordinates.

u = 13 * L * (u’ – 0.1978)

v = 13 * L * (v’ – 0.4683)

For carnation pink:

u = 13 * 75.41 * (0.2583 – 0.1978) = 59.25

v = 13 * 75.41 * (0.4449 – 0.4683) = -22.97

So carnation pink (#FF99CC) has Luv coordinates of 75.41, 59.25, and -22.97 (within rounding error). That’s only an approximation, of course, assuming a “standard” luminous digital display. Individual displays vary a little from the standard, and these numbers mean nothing if you’re dealing with a printer. But what do you expect from such a simple calculation?

To know how well carnation pink contrasts with say, deep orchid (#9933CC, or 153, 51, 204), get the Luv coordinates for deep orchid (which are 43.36, 30.74, and -95.69) and calculate the three-dimensional distance separating them:

Color Difference = sqrt[ (L1 - L2)^2 + (u1- u2)^2 + (v1 - v2)^2 ]

Or:

Color Difference = sqrt[ (74.41 - 43.36)^2 + (59.25 - 30.74)^2 + (-22.97 - -95.69)^2 ] = 84.43

For comparison, the difference between black (#000000) and white (#FFFFFF) is 100, so you have quite good color contrast between carnation pink and deep orchid. You should also check the perceived brightness scale contrast, because in a lot of situations, brightness matters more in discriminating colors than hue (and I’m not just talking about users with color perception deficiencies). The two L values can be directly compared:

Brightness Difference = L1 – L2

Brightness Difference = 75.41 – 43.36 = 32.04

That’s barely adequate brightness contrast. However, if you want *really* bad, run the calculations on Office’s title bars, whose background colors average #DFE7EC and #DAECF8. The color difference is only 9 and brightness difference is essentially nil. No wonder I can’t recognize them.

Using the above process you can go through potential color candidates to select a set of codes that are reasonably distinct from each other. Do yourself a favor and put the formulas above in a spreadsheet to save yourself some time. Or download mine, putting in your RGB values in the framed shaded cells (no warranty expressed or implied, user assumes all risk, do not use while operating heavy machinery).

However, you don’t have to go through a whole lot of calculations to realize that high saturation colors are going to be more distinct from each other than low saturation colors, so maxing out at least one of the RGB values to FF is a good starting point for a color. Also keep in mind the following:

- Yellow and white are hard to discriminate since the human eye is maximally sensitive to yellow light (they have L’s 91 and 100 respectively), especially in low-contrast reading conditions (e.g., using a mobile device in sunlight). Avoid having both pure yellow and white as codes. Instead, you can make your white a pale blue, or make your yellow more of an orange or amber (like I did to code Special Forces), which will still contrast well with red, if you have that too.
- Very pale blue and white are also hard to tell apart, so avoid having them both too.
- Pure green (#00FF00) and pure cyan (#00FFFF) is another pair than can be easily confused at a glance. Give them a little extra space apart (e.g., #00FF00 and #00A0FF).
- Brown and orange also don’t play well with each other on the same digital display. You may want to use lighter shades of brown for good contrast on a dark background, but tan is just low-saturation orange; they have the same hue and brightness, making them much alike. If you have lots of tan objects on your display, orange is not a good choice of indicating an alarm state.

Of course you need your color-coded symbols or icons to stand out against your background so your users can see and read them. Luv coordinates can also be used to determine foreground-background contrast, but brightness contrast (as represented by L or Y) is more important for foreground/background distinction. Thus, you generally want to go with pure white (#FFFFFF) or pure black (#000000) for a background to have maximum contrast with whatever colors you have for your symbols in the foreground.

As comparing L values will show you, if you’re using high-saturation colors to make them maximally distinguishable from each other, you’re usually going to get more contrast with a black background than a white one. However, white backgrounds can hide glare better and dark text on large white backgrounds is usually easier to read than light text on large black backgrounds, so if you’re displaying a lot of text in addition to symbols, you may want to work with a white background. Trying to compromise and use a medium gray background is not going to work.

L and Luv comparisons will also show you that some of your colors stand out more than others against a given background. That can be a good thing if the colors that stand out best are the more important ones to see. If not, then consider adjusting your color values so this is the case. For example, red contrasts relatively poorly against a black background. For coding alert states, you might want to go with orange. If you must use red, maybe you should use a white background, or you could outline the alert symbol in white, or use a red-and-white pattern to make it contrast more against a black background.

If you want to color code the *background* in addition to objects in the foreground, you’ve got a real challenge. Now the contrast of your foreground objects is going to depend on the background they happen to be on. Now you need background colors that are distinguishable from each other but still contrast well enough with all your foreground colors. Good luck with that. It’s pretty much impossible to get high contrast among a large range (over 3) of foreground colors, a large range of background colors, and between each foreground and background color. You must prioritize your contrasts:

- Most importantly, have good brightness contrast between each foreground/background pair of colors.
- Secondly achieve good color contrast among your foreground colors.
- Lastly maximize the color contrasts of your backgrounds.

The brightness contrast ratio for all possible foreground/background combinations should be at least 3.0. We use Y to calculate the ratio:

Brightness Ratio = (Y1 + 0.05) / (Y2 + 0.05)

Where Y1 is the larger Y of the two colors you’re comparing.

Background color contrast has lower priority than foregrounds because background codes tend to occupy relatively large areas on the screen, while foreground objects tend to be relatively small symbols or icons. Users can recognize and see differences in colors and brightness better when they occupy large areas. Save your vivid high-saturation colors for foreground objects. Use duller or paler colors for the background. This will not only optimize your user’s performance, it will also be aesthetically easier on the users’ eyes. Big areas of vivid colors will just drive them bonkers. If you need reference marks or other details in the background (e.g., graticule, borders), try to keep them to shades of gray so they won’t complicate the perception of the colored objects on top.

If you’re still having trouble getting good foreground-background contrast for all your color codes, try outlining your foreground objects with a high contrast color like white or black regardless of their background. That’ll help. In the example below that combines color codes for salinity with those for asset functions, I went a step further and surrounded the symbols with a small area of black.

Your color codes need to be internally consistent throughout your product, of course. That includes using the same mapping of colors to values between sessions -don’t re-scale the color codes because the current range of values has changed. You also want your colors to be consistent with your user’s stereotypes from prior experiences. This includes the colors used in similar products (e.g., if users are used to regarding gray to mean “disabled,” they may assume the same applies to your icons), and more general cultural meanings associated with various colors. Most people in western civilization associate blue with cold or water. Red means fire, stop, or danger. Be careful who you call dangerous. At the very least, you want to be sure your colors don’t strongly *contradict* user expectations. If Offense is your most “dangerous” function, then make it red. Red is probably not a good choice for Logistics. Your sea is better off blue than brown, although you could get away with a sea of green as long as you’re not also trying to represent botanically lush terrestrial regions.

Consistency with user color expectations is complicated by the fact that different users have different stereotypes depending on their experience. Different cultures, subcultures within a culture, and eras within a culture will have different meanings for different colors. Color associations may vary by occupation, subject domain, or user role. For example, lay people would likely associate darker shades of blue with deeper water than lighter shades. However, for sailors familiar with traditional nautical charts, the opposite is true -white represents the deepest water.

Except for some stereotypes, color codes are arbitrary, and learning color codes can constitute a substantial memory burden (which is another reason to have only a few color codes). This is especially the case for color coding numerical and ordinal data. Other than ranking green – yellow/orange – red as an ordinal scale of threat, people don’t have a natural ranking for colors. When orders are imposed on colors, they are not consistent across domains. Electrical engineers use a different order than karate dojos, for example. To someone at Microsoft, gray suggests greater risk than blue, but I don’t know about anyone else.

However, this lack of consistent ranking only applies if you attempt to use nearly the entire spectrum. If you keep your color codes to a blend between two or three adjacent primary and secondary colors on the spectrum, people will see an order to it (although they don’t necessarily know which end is the “high” end). You can use Luv coordinates to make sure your colors are appropriately spaced on the scale they represent.

This is what I did for representing salinity, blending it from blue to green. Going through three adjacent colors (e.g., yellow-green-blue) is suitable when the middle of the scale represents a neutral or zero value (rather than simply “medium”), and also makes for more color contrast among the codes.

Of course, limiting your colors to a segment of the spectrum mean less color contrast among the codes than using the whole spectrum. However, if you also include changes in brightness like I have done to make it easier for your users, it may be adequate for background color coding, once again relying on the large areas the colors occupy. For foreground objects, you may want to use a different graphic dimension to represent ordinal and numeric data.

The best way to provide accessibility to users with color perception deficiencies is through some redundant representation in addition to hue. This includes:

- Ensuring your color choices are also well separated in brightness (L), while still all contrasting well with the background.
- Additional graphic codes like shape or size, such as I’ve done with the shape of the symbol outlines.
- Text representations of the attributes.

Of course, presumably you chose color to represent an attribute precisely because of the unique advantages that color provides (e.g., for pattern recognition).

Thus, users who have to rely on the redundant representations will be at a disadvantage no matter what. For example, the brightness difference of the color codes I’ve chosen for these maps is as low as 6 -that’s a too small of a difference to see. Furthermore, the brightness differences may imply a ranking that does not accurately characterize the attribute values. There’s not much you can do about this if you’re using a large number of color codes.

However, if you have only a few color codes (say, three), it’s possible to select colors (and levels of brightness) that ameliorate the situation for most users with color perception deficiencies. There are various tools on-line for testing your colors. A good place to start with the colors themselves is to stick to yellows and blues, since by far the more common form of deficiency concerns red-versus-green discriminations.

For example, eye-tracking heat maps should go blue-green (or gray)-yellow, rather than the green-yellow-red I usually see. For the most common form of color deficiency, the latter makes the two extreme values hardest to discriminate, precisely the opposite of what you want. The former preserves the “heat” metaphor (in Western cultural stereotypes, blue is cold, yellow is hot) and redundant brightness coding is relatively easy (of all the primary and secondary colors, high saturation blue has the lowest L value while high saturation yellow has the highest).

If you must use green and red, try the traffic-light trick and mix some blue with green and some green with red like I’ve done with the symbols in this set. This will make greens looks bluish and reds look brownish to those with red-green perception deficiencies. If you must use green-yellow-red (e.g., for ranking danger levels), use a pale bluish green and a darker red. This will appear as like blue-yellow-black to those who can’t discriminate greens and reds, which is still reasonably close to our cultural stereotyped associations between color and danger. It also can provide redundant light-medium-dark brightness coding.

You want high-saturation colors for maximum contrast, and people will generally agree on the names for high-saturation red, yellow, green, and blue. However, other high saturation colors, such as magenta and cyan, have no widely accepted names. Many people don’t know what “cyan” and “magenta” mean. Some might call the former “aqua” while others call it “light blue,” for example. The latter may be “hot pink,” “fuchsia,” or “bright purple.” If you are going to use these colors -or any hues other than red, yellow, green, and blue, be prepared for possible confusion in written or verbal communication (e.g., with tech support). Yet another reason to limit your number of color codes.

Due to the physical optical effects of lenses, including the ones in our eyes, different light frequencies require different degrees of focus. To focus on blue objects, for example, the lens of your eyes need to focus “farther” than to focus on red objects. This means to switch from looking at red objects to blue objects, users have to refocus their eyes, which can be tiring. You should avoid color sets that results in a lot of high-saturation red and blue objects if the task requires the user to shift focus repeatedly from one to the other.

Blue is hard to focus in general because of its short wave length, a phenomenon that grows more acute as a user ages. High saturation blue (#0000FF) should be avoided for anything that requires detailed study, such as text or subtly varying symbols or icons. Make it a darker or lighter blue (e.g., mix in some green) to avoid problems with this. The default color for unvisited links? Not really a good choice.

On the other hand, the difference in focus levels of different colors could be an phenomenon to exploit. Blue objects tend to appear farther away than red objects, an illusion called “chromostereopsis.” Stare closely at the symbols in the map below, and subs will appear “deeper” than the GLUV.

Well, sometimes it works.

This may help users visually filter out blue objects and focus only on red objects or vice versa. It could also be used to suggest different physical levels. Used to be they’d color code elevation on maps, with the seas being conveniently blue, and therefore at the bottom, while the mountains were reddish, appearing on top. It didn’t work out too well because it also meant sea-level deserts, like on the Arabian peninsula, appeared a rich green, suggesting heavy vegetation. Also, chromostereopsis is only a strong effect with very high saturation colors. However, it might be worth trying to use chromostereopsis in certain special-purpose applications. Maybe submerged assets like submarines should be blue while flying assets like GLUVs should be red, for example. More broadly, mixing a little blue in your background (e.g., using a midnight blue or very pale blue) may subtly help your foreground objects appear more “on top” of it.

…But so many constraints and considerations. It’s difficult to come up with a color set that even minimally satisfy them all. It easy to just throw up one’s hands and let the users choose the colors they want, but that rarely works out. If you as a usability professional can’t figure out how to balance all these constraints, what chance does a layperson have? If you see a need to let users choose their colors, provide a limited palette for limited purposes (e.g., to pick the color to represent oneself). You can also provide a global color-contrast setting that adjusts brightness and color contrasts (consistent with Luv calculations) while keeping the hues fixed. Users can then adjust the colors to their liking for their current displays, viewing conditions, and visual abilities.

The effects of combinations of colors is not easily predictable, so once you have selected your colors for both foreground and background, you should test them as a complete system on your users, checking that they all work well together.

**Potential Solution**:

- Use color codes to aid pattern recognition, visual search, and possibly value identification.
- Reserve color coding for attributes whose values are highly task relevant and reliably known.
- Do not rely on color changes to capture user’s attention.
- Limit the number of codes to 6 if absolute values must be identified.
- Use Luv coordinates to achieve distinct colors and good foreground-background contrast.
- High-saturation colors will be most distinct from each other.
- Foreground and background should have a brightness contrast of at least 3.0.
- Color codes for large background areas may be closer to each other than color codes for small foreground objects.
- Use colors consistently within your app and with user expectations and experiences.
- Remember that color codes can be arbitrary and therefore hard to remember.
- Consider using part of the spectrum for representing numeric or ordinal data.
- Provide redundant representations of the data for accessibility.
- Favor blue and yellows for maximum accessibility.
- Consider colors that are easily named.
- Avoid having high saturation blue and red objects used in the same task.
- Avoid high-saturation blue for text or fine detail.
- Limit user ability to customize color codes.

The really valuable work is in serving the advertisers by providing them eyeballs and personal information. I realize now that all my efforts to maximize the user experience are misguided. Users are the product and advertisers are the consumers. Effective today I now dedicate myself and this site to advertiser-centered design.

As it is, I’m playing catch-up. The reason businesses pay for UX in the first place is in hope of promoting sales. We create experiences that encourage people to buy, which is what advertisement is all about. It’s time I recognize that UX is a component of advertising.

Sites have long had advertisements to finance themselves. These sites pay for advertisements on other sites to drive users to their sites where they click the advertisements providing the financing for more advertising. Not a few these advertisements on the site advertise other sites also supported by advertisements. Click the right ads from site to site and surely you’ll end up back where you started.

Modern advertisers know how to get others to do their advertising for them. Search engine optimization turns Google search results into advertisements. Companies like Apple turn product launches in “special events” covered by the press, aiming to change the news into advertisement.

This is hardly new or limited to the web. We have long been buying hats, t-shirts, and other apparel that advertise products. We don’t wear clothes. We wear banner ads. Stores like Abercrombie and Fitch and Old Navy advertise apparel which advertises the stores. Today our entertainment isn’t just supported by advertisement. It *is* advertisement. Hollywood advertises movies that advertise other things through product placement. Cable networks advertise TV shows interlaced with advertisements through the use of “bugs” and “snipes.”

It’s pretty clear where the web and, by extension all of Western Civilization, is heading. Soon everything will be advertisements. Just as the agricultural economy gave way to the industrial economy, which yielded to the service economy, the service economy will fall to the advertising economy, where the majority of the population is involved in advertising. Our world will be filled with advertisements that advertise more advertisements people can experience, ad infinitum. It’s a utopia where everything will be free, paid for by advertisements. The economy will be distilled to its essence: pure selling.

I want a piece of that.

You may have noticed a subtle change to this web site: I’ve started including advertisements. At first I wasn’t sure where or how to advertise to attract advertisers, but then I realized that, since I’m an UX expert, I must also be an advertising expert. And as an advertising expert, it would be hypocritical if I didn’t acknowledge that the most important thing for me to advertise is me.

I turns out I could offer myself some very reasonable advertisement rates, which have nonetheless proven very lucrative for me. I’ve brought in thousands of dollars in revenue from myself on my first day. There’re expenditures to be counted still, of course, including a substantial advertising budget, but that is to be expected since advertisements is all I’ll be producing from now on. I’ll make profits through volume. You got a better business plan for today’s economy?

]]>What can I say? We blew it. The web came along, and we human factors engineers were AWOL. Not that anyone tried to include us. I know I was home at the time, yet CERN never called me to say, “Well, Mike, we’re going to launch the Information Age, and we’re wondering if you had any advice.” Maybe they thought allowing for the human factor just wasn’t that hard, so they’d take care of it themselves. Typical. And to be honest, it wasn’t that hard. Making links appear bright blue seemed like a reasonable enough standard. In a time when web pages generally appear as black text on a pale gray background, blue would stand out, suggesting liveliness and therefore interactivity, while avoiding the Western emotional associations of green, yellow, or red. Underlining links was icing on the cake, providing a redundant cue that aids accessibility. They were probably pretty proud they thought to do that.

Of course, if they had consulted a human factors expert, she or he would have pointed out that bright blue objects can be difficult to focus on, especially for older users -not too good for reading text. She or he would have also pointed out that underlining interferes with reading, owing mostly to masking of the decenders of lower-case letters.

I like to think we would have done better, if we were involved. Maybe we would’ve suggested background shading to indicate links, rather than colored font and underlining, This would maximize readability while still making links stand out and look active in an accessible manner, like highlighted text in a physical book. We could have allowed developers to specify the size of mouse-sensitive areas so that links with short labels could have large targets. Lists of links would appear as unbroken shaded rectangles, like sidebar menus often do today, eliminating the dilemma of choosing between cluttering underlining for link lists and no underlining cues for in-line links in order to achieve a consistent link appearance across the site.

We could have tested and developed a standard set of icons to put with each link text, like a cite or footnote number, that would provide a redundant cue of a link. These would distinguish links from other uses of background shading (which would be rare, much as it is today) and identify links when the presence of shading is ambiguous (e.g., when the entire page is links). But most importantly, the icons themselves would indicate:

- The location of the linked content, whether on the same page, the same site, or another site.
- The medium or format of the content (e.g., html, pdf, image, sound, or video).
- Whether the link opens a new window or not.
- Whether the content had been visited before or not (rather than using arbitrary color, a standard that has fallen into disuse since no one uses standard colors anymore).

With some minor tweaks to HTTP and HTML behavior, the icon could indicate whether the content has been updated since the last visit, which is often what users really care about. Maybe it could even indicate if a link is broken.

Tooltips would provide text representation of the icon’s information, as usual, but with standard icons, experienced users can extract this information with a quick visual sweep of the page. The tooltip would also display the size of the content (or estimated download time) and URL of the content (if no title tag is provided), rather than it appearing far down in the status bar where few users notice it. Animation in the icons would indicate if the browser is attempting to retrieve the content, so users can see the effect of their clicks right where they’re looking, rather than shifting attention to a distant and ambiguous throbber or status bar.

Oh, well, too late now.

Returning from dreamland, it seems that too many web sites of reality have enough trouble complying with the meager existing standards and guidelines of today. Sites respond too slowly. Trivial input format variations are rejected. Unnecessary navigation and other input are imposed. Lengthy work by the user is discarded without warning or alternatives. Links masquerade as ordinary text and vice versa. Information is shown in no meaningful order to the user.

These are obvious design flaws. Anyone with any UX background could predict that such things will interfere with usability, yet somehow they make it into production. My hunch is that whoever was in charge of the web site just wanted it that way (”it makes perfect sense to *me*“), and summarily dismissed others’ concerns as merely a difference of opinion or taste. A simple usability test may settle the issue, but you can’t test everything and not every issue is truly testable by a small sample size usability test. Academic research results are great, but some issues don’t interest academic researchers. They want to build general theories of HCI or develop innovative interfaces. Testing something with obvious results doesn’t fit well with that.

That leaves citing authoritative standards and guidelines to bolster your argument against bad design. There are a number of such standards and guidelines you can cite, and they all have their uses, but also their limitations:

- Usability.gov represents the gold standard of web UI guidelines, with each based on cited research. How this became the responsibility of the National Cancer Institute is beyond me, but I’m glad they did it. However, being researched based, they tend to be limited to issues of academic research interest.
- Lynch and Horton’s Yale Web Style Guide provides valuable guidelines, although it tends to focus on graphic design and typography.
- Ameritech Web Page User Interface Standards and Design Guidelines by Mark Detweiler and Richard Omanson has many recommendations that are as relevant today as they were back when most screens were 640 by 480 pixels. That was also back when there was a company called “Ameritech,” and copies of this document are hard to come by now.
- W3C Accessibility standards have done as much to improve the UIs for typical users as it has for those with disabilities. We all benefit from site characteristics like proper use of markup and redundant channels of communication. But, not surprisingly, these standards have limited scope for usability.
- Operating system guidelines, such as those for Apple, Microsoft, and Gnome, may be used. For example, they tell us we should label our buttons with what they do (e.g., “Register”), rather than with a general uninformative term (e.g., “Submit”). However many such guidelines were never intended to apply to the web world, and they’re easy to dismiss as not applicable.
- Guru books, like those by Krug and Nielsen, contain solid advice. However, some may dismiss them as just the opinion of someone who happens to have a high hourly consulting rate.

Perhaps less well-known are more general human factors standards that also may be applied to web site design. Various major organizations have such standards (e.g., ISO, NASA, FAA, ASTM) that provide a significant supplement to the above standards and guidelines. Among the most venerable standards documents is the US Department of Defense Design Criteria Standard: Human Engineering, known to its friends as MIL-STD 1472. First published around 1968, MIL-STD 1472 has served as the inspiration, if not the source, for many human factors standards to come, making it the literal mother of all HF standards. Originally a consolidation of older standards and updated semi-regularly ever since, MIL-STD 1472 represents the collective research and operational experience of decades of human-machine performance under the most demanding circumstances. There’s nothing quite like life-or-death situations to highlight usability problems.

Human factors as a distinct field came into its own during World War II when the military was confronted with ordinary people having to operate extraordinary complicated and unfamiliar technologies of the time, like radar, multi-engine high-performance aircraft, bombsights, and artillery targeting computers. It was a situation analogous to the recent one with the web, with ordinary people using increasingly capable servers and networks. While there’s a huge gulf separating a 1940s analog electromechanical artillery computer from the modern gigahertz and terabyte web server, the human operators are in many ways still the same. Thus, many MIL-STD standards remain relevant.

MIL-STD 1472 is a standards document, not a best practices document. It specifies the minimal requirements, and as such will only catch gross usability flaws. One has to wonder about some of the standards, like 5.2.3.1.5.1 Graduations, which specify that the gradations of analog gauges “shall progress by 1, 2, or 5 units or decimal multiples thereof.” I mean, did some manufacturer deliver a jeep with a speedometer marked at 0, 7, 14, and 21 mph? Who would ever think of designing something like that? Well, presumably someone did, and the results in a time of war were ugly, so they made a standard to prevent it from happening again. Basic standards like these are fine for our purpose of dismissing stupid designs that aren’t worth a usability test.

Of course, general standards like 1472 have their weaknesses too. But, along with other standards and guidelines, it’s another tool to have in your toolbox. MIL-STD 1472 comprises human factors standards for all kinds of products, from safety glasses to aircraft carriers. Standards scattered throughout the document can apply to your web site. For example, much of Section 5.5 Labeling applies as much to headings, fields, and link labels on a web site as they do to physical labels. However, Section 5.14 is dedicated specifically to computer UIs, so let’s focus there.

As might be expected for standards dating to the 1960’s, much of Section 5.14 were intended for antiquated computer systems featuring simple menus and forms. Contemporary GUI features, like direct manipulation, multiple windows, and scrollbars, are not addressed in the latest version, 1472-F. So that makes it just about perfect for your average modern web site. Strip away the browser, and you’ll recognize that a website is not all that different from early interactive character-cell user interfaces seen in the 1970s. Users may use multiple windows or tabs, but web sites don’t. Web sites divide their UI into pages, while character-cell UIs divide them into screens. The typical web site is almost entirely a menu interface. Whether arranged in a neat sidebar or distributed in-line with content, each link is by most definitions a menu item. Occasionally a web site has a modal form to complete, where the user keys in some simple alphanumeric strings into fields and selects an Execute command when done. That’s pretty much what the military was dealing with 30-odd years ago.

Nonetheless, general standards need to be subject to interpretation when applying them to a specific technology like web sites. The text of the standards may have a legal tone to them, but for the purpose of achieving the best usability, they should be interpreted according to the spirit of the law, not necessarily the letter. Users don’t care if you’re *technically* complying with standards they’ve never heard of. They only care if the web site gets them the content they need.

If you’re new to reading standards, there’s a big difference between a standard that uses “shall” versus “should.” “Should” indicates a recommendation, something you as a designer can choose not to comply with if you have good reason. In my book, a “good reason” is research-backed analysis or usability test results that show that following a standard will result in inferior usability. You shouldn’t violate a “should” because of a hunch on user performance or to avoid inconveniencing the developers. “Shall” indicates a requirement. The design must follow “shalls” to be in compliance with the standard.

Here, then are some 1472 standards that I think are most apt for web sites along with my interpretation for them.

**5.14.3.2.2.2 Related data on same page.** When partitioning displays into multiple pages, functionally related data items shall be displayed together on one page.

This means that content shall be divided among pages by function, task, or similarity. Prose and forms shall not be arbitrarily divided into pages, such as done to articles at the web site of the Minneapolis Star-Tribune among other on-line newspapers and magazines. The readers get to the end of a page and have to mentally cache the context while they click Next Page and wait several seconds for all the ads to load before they can proceed. And sometimes the only thing on the subsequent page is the post date and contact information for the writer.

**5.14.4.1.7 Hierarchical process.** The number of hierarchical levels used to control a process or sequence should be minimized. Display and input formats shall be similar within levels. The system shall indicate the current positions within the sequence at all times.

The first part tells you that you should generally go for a relatively broad and shallow IA hierarchy to minimize the time-consuming page-loads a user must endure. It’s a “should” rather than a “shall” because at some point too much content or too many links on one page will interfere with usability, so the standards give you an out. However, it does mean you should avoid pages with sparse content, for example, any time the content takes less than 700 vertical pixels to fill. If two pieces of related content can be visible on the same page without scrolling, then they should be on the same page.

The above standard also covers the labeling of pages, which is elaborated by these standards:

**5.14.3.2.2.3 Page labeling.** In a multipage display, each page shall be labeled to show its relation to the others.

**5.14.3.1.12 Page numbering.** Each page of a multiple page display shall be labeled to identify the currently displayed page and the total number of pages, e.g., Page 2 of 5.

Collectively, these standards say that if the page is part of a linear or hierarchical navigation structure, then the pathway through the structure to the page shall be represented on the page (e.g., with breadcrumbs). They also say that you shall show progress through linear structures, like search results, wizards, and multi-page forms.

**5.14.3.2.5 Context for displayed data.** The user should not have to rely on memory to interpret new data; each data display should provide needed context, including recapitulating prior data from prior displays as necessary.

**5.14.4.1.5 Availability of information**. Information necessary to select or enter a specific control action shall be available to the user when selection of that control action is appropriate.

These standards mean that the information that is typically referenced to complete a task shall be available on a page supporting that task without the need to navigate to another page in the site. This is especially important for web forms where navigating away to try to get some key piece of information can mean wiping out the work the user had done so far.

**5.14.3.1.4 Order and sequences.** When data fields have a naturally occurring order (e.g., chronological or sequential), such order shall be reflected in the format organization of the fields.

Content shall not have an arbitrary order but rather shall be ordered and/or grouped by at least one of the following as determined by the users’ task:

- Frequency of use.
- Importance to task.
- Sequence of entry or review.
- Functional relations.
- Functional similarity.
- Name (i.e., alphabetical).
- Date or time stamp.

For dynamically created forms or tables, this means don’t forget the ORDER BY clause in your stored procedures. Note that some of the above criteria for ordering imply your database may need fields specifically for determining display order (e.g., task importance).

**5.14.3.5.10 Justification of numeric entry.** Users shall be allowed to make numeric entries in tables without concern for justification; the computer shall right-justify integers, or justify with respect to a decimal point if present.

This means that columns of numbers that use the same units shall have the decimal place of the numbers vertically aligned, so users can compare numbers easily. Columns should look like this:

5.0750 |

1.8000 |

46.5000 |

.5209 |

8.0000 |

Rather than this:

5.075 |

1.8 |

46.5 |

.5209 |

8 |

For user-entered numbers, a form shall pad the numbers as necessary to keep the decimal points aligned irrespective of what the users type.

**5.14.4.2.6 Simple menus.** If number of selections can fit on one page in no more than two columns, a simple menu shall be used. If the selection options exceed two columns, hierarchical menus may be used.

This means use simple (non-hierarchical) menus up to two columns wide rather than split the menu into two pages, assuming the menu items fit on a page with the other content intended for the page. Avoiding a click to navigate down a level in a menu hierarchy is worth the added scan difficulty that two columns imposes. However, once you get to needing three columns or more, you might be better off splitting the menu up onto separate pages. That’s up to you.

**5.14.3.5.13 Column scanning cues.** A column separation not less than three spaces shall be maintained.

Translating that to the web, columns in tables shall be horizontally separated by a distance equal to 3 spaces of Arial font of the same point size as used in the columns when the browser is set to display medium text size. This makes the columns appear as cohesive wholes, making it easier to scan down them without the eye “jumping” to an adjacent column. In some cases, such spacing obviates the need for cluttering vertical rules, further improving scanability.

**5.14.3.3.1 [Display Coding] Use. **Coding shall be employed to differentiate between items of information and to call the user’s attention to changes in the state of the system. Coding should be used for critical information, unusual values, changed items, items to be changed, high priority messages, special areas of the display, errors in entry, criticality of command entry, and targets. Consistent, meaningful codes shall be used. Coding shall not reduce legibility or increase transmission time.

In other words, colors and other graphic elements should be selected to consistently *mean* something, not just make the web page look good. Specifically:

- All navigation menus shall use the same color and style for all pages.
- The navigation menu shall appear visually distinct from any other content that may appear in its place on a page.
- All text links shall have the same color unique to links, except that visited links may have a different color than visited links.

**5.14.5.6 Highlighted option selection.** Any displayed message or datum selected as an option or input to the system shall be highlighted to indicate acknowledgment by the system.

This means selected items need to be somehow graphically distinguished from non-selected items. For example, on a navigation menu sidebar, the selected menu item shall change appearance on selection, and that appearance shall be maintained as long as that menu selection is maintain. If the menu item corresponding to your current page looks like all the other menu items, then you’re doing it wrong.

**5.14.4.8.3 Supplementary verbal labels.** Where icons are used to represent control actions in menus, verbal labels shall be displayed, or made available for display, with each icon to help assure that its intended meaning will be understood.

So each icon needs at least a tooltip, but keep in mind these are minimum standards. For most actions, you get better usability if you include a static text label, or even use a text label *instead* of the icon.

** 5.14.9 System response time**. Maximum system response times for real-time systems shall not exceed the values of Table XXII.

Below I’ve reproduced portions of Table XXII with a separate column to give the equivalent application for web sites. For example, any user mouse click, including input to a custom AJAX control, shall result in some kind of a visual response within 0.2 seconds. A page shall begin to appear in a window within 2 seconds after the user selects its link.

Input | sec. | Web Equivalent |

Key Print | 0.2 | Display of a keyed character after any per-character validation and correction. |

XY Entry, Pointing | 0.2 | Response of some kind of visual verification to a mouse click. |

Sketching | 0.2 | Update period for any drawing or any dragging. |

Page Scan | 0.5 | Update period for carousels and other scrolling controls. |

Page Turn | 1.0 | Response of tabs, lightboxes, and other loading of content within part of a web page. |

Host Update., Simple Inquiry | 2.0 | Response from selecting a link until a page begins loading. |

Error Feedback | 2.0 | Response for any server-side validation after data submission. |

Complex Inquiry | 10.0 | Response for ad hoc queries or other operations involving submission of multiple fields. |

File Update | 10.0 | Response for loading non-HTML content, such as pdfs. |

Once again, these are *minimum* standards assuming worse-case conditions.

Too often discussions of response time for web sites take one of two extremes. One extreme is ”make it as short as possible.” This does little to help you specify the appropriate server and network performance for the real world. The other extreme is, “right before the point where users abandon the site.” I would hardly define “usable” to mean “the level of performance where users just barely stop assuming your site is broken.” Table XXII provides an alternative to the extremes that has held up remarkably well over the decades.

**5.14.1.2 Computer response.** Every input by a user shall consistently produce some perceptible response output from the computer.

**5.14.5.2 Stand-by.** When system functioning requires the user to stand-by, a WORKING, BUSY, or WAIT message or appropriate icon should be displayed until user interaction is again possible. Where the delay is likely to exceed 15 seconds, the user should be informed. For delays exceeding 60 seconds, a count-down display should show delay time remaining.

**5.14.2.1.3 Processing delay.** Where system overload or other system conditions will result in a processing delay, the system shall acknowledge the data entry and provide an indication of the delay to the user. If possible, the system shall advise the user of the time remaining for the process or of the fraction of the process completed.

For most web sites, the above standards are not much of an issue. Most input is asynchronous, so there’s no need for a BUSY indication. The browser’s throbber provides immediate response for the most common input, link-clicking, so you don’t have to worry about it. Start loading your pdfs and other bulky files within 10 seconds for worse case conditions, as discussed earlier, and you don’t have to worry about informing the user that they’ll need to wait over 15 seconds. However, if you start getting fancy, employing AJAX or Flash which can subvert the throbber and other browser feedback, then you may find you have to create your own feedback, including progress bars or equivalent for anything over 60 seconds.

**5.14.2.1.8.4 Explicit actuation.** A separate, explicit action, distinct from cursor position, shall be required for the actual entry (e.g., enabling, actuation) of a designated position.

**5.14.3.6.9 Confirming cursor position.** For most graphics data entry, pointing should be a dual action, with the first action positioning the cursor at a desired position and the second action confirming that position to the computer. An exception may be a design allowing “free-hand” drawing of continuous lines where the computer must store and display a series of cursor positions as they are entered by the user.

Here the main culprits on web sites are a breed of Javascript pulldown menus that display their contents on mouse-over rather than a click. Menus shall only appear when the user clicks the mouse. However, the above standards should not be interpreted to preclude the use of mouseover effects to provide pointer position feedback or supplemental information to the user, such as through tooltips.

**5.14.3.1.6 Recurring data fields**. Recurring data fields within a system shall have consistent names and should occupy consistent relative positions across displays.

This is just a matter avoiding oversights and double-checking that you’re being consistent. You don’t want people being unsure if the “Check Date” is necessarily the same the “Payment Date.”

**5.14.3.1.1 Consistency**. Display formats should be consistent within a system. When appropriate for users, the same format should be used for input and output. Data entry formats should match the source document formats. Essential data, text, and formats should be under computer, not user, control.

This standard was probably made with keypunching in mind, but in this era of copy and paste it’s ever more important. For example, your web site should never reject the input of a date, phone number, credit card number, or other datum when it has the same format your own web site uses.

**5.14.3.7.10 Predefined formats**. When text formats must follow predefined standards, the required format shall be provided automatically. Where text formats are a user option, a convenient means should be provided to allow the user to specify and store for future use the formats that have been generated for particular applications.

**5.14.3.5.7 Numeric punctuation**. Long numeric fields should be punctuated with spaces, commas, or slashes. Conventional punctuation schemes should be used if in common usage. Where none exist, a space should be used after every third or fourth digit. Leading zeros shall not be used in numerical data except where needed for clarity.

And while we’re on the topic of credit card numbers, we must stop this madness of rejecting user input of these numbers with spaces. For any formatted string input, if the user omits delimiters for formatted text, than the delimiters should be added automatically to user input without displaying an error message. Prohibited characters (like letters in a credit card number) should be automatically suppressed or removed from user input without displaying an error message.

**5.14.4.3.6 Cursor.** A displayed cursor shall be positioned by the system at the first data entry field when the form is displayed.

This means that when displaying a cleared form or a form with new content for a session, focus shall be placed on the editable control at the topmost leftmost position. Save the user from having to use the mouse to click into the field. This applies to pages for searching and querying.

**5.14.7.1 Workload reduction**. Default values shall be used to reduce user workload. Currently defined default values should be displayed automatically in their appropriate data fields with the initiation of a data entry transaction and the user shall indicate acceptance of the default.

**5.14.7.4 Defaults for sequential entries**. Where a series of default values have been defined for a data entry sequence, the experienced user shall be allowed to default all entries or to default until the next required entry.

That means that when a list box, dropdown list (combo box), checkbox, or radio button (option button) is used to acquire user input of an unknown value, it shall default firstly to a value that does not result in damage to the system or lost of data, and secondly to the most frequent value.

I’m not sure where the web got the idea to never use defaults, except to default to opt in for annoying email spam. Radio buttons are often left with none selected. Dropdown list default to nonsensical “Select an option.” values. Thick-client desktop apps worked fine with defaults for these controls for years before the web showed up. I suspect that many web designers think you get more accurate data if you force the user to explicitly choose an option. While there may be situations where this is the case, usually *defaults* make it more likely you get accurate data, especially when there’s one choice that is likely to be correct. By freeing the user from selecting the value you avoid the possible slips the users might make, as well as make input faster for the users.

**5.14.8.1 Error correction**. Where users are required to make entries into a system, an easy means shall be provided for correcting erroneous entries. The system shall permit correction of individual errors without requiring re-entry of correctly entered commands or data elements.

If the user makes an error on a form and submits, you can show an error message on the subsequent page. That’s not the best way of handling errors, but it meets a minimum standard of usability. However, when the user goes to the form to correct the error, all entries for all fields shall be just as the user left it. Do not present a cleared form. Do not clear *any* fields, not even the password field. It adds work and sets the user up for another error.

**5.14.4.1.13 Feedback for correct input**. Control feedback responses to correct user input shall consist of changes in state or value of those display elements which are being controlled and shall be presented in an expected and logically natural form. An acknowledgment message shall be provided only where the more conventional mechanism is not appropriate or where feedback response time must exceed one second.

**5.14.5.4 Input confirmation**. Confirmation shall not cause displayed data removal.

Web forms use way too many confirmation pages. Partly this is because of the unreliability of web connections and the long delay between the user selecting the Submit button and the server sending some kind of acknowledgment. As a minimum standard, a confirmation page is acceptable for such cases, but the confirmation page should display what data is being confirmed. A purchase confirmation, for example, should list the products, prices, and shipping.

Once you get into AJAX, with response times less than a second, there shall be no confirmation message. It is sufficient and less distracting to display the changes to the data as the confirmation.

**Potential Solution**: Use standards and guidelines such as MIL-STD 1472

For more of my interpretations of MIL-STD 1472 for the apps, specifically for tables, see Table usability/Readable font for a data table at UI StackExchange.

]]>