Designed to Fail

Instability and unreliability, and what UX designers can do about it.

HAL 9000 suffering a blue (or red) screen of death

Crap, I clicked the scrollbar. What a stupid mistake. I only clicked once, but now I’ve lost control of my computer. The document is stuck on scrolling on its own as if it thinks I’m still holding down on the scroll button. Clicking anywhere has no effect. Keyboard input is likewise ignored. I can’t even bring up the task manager to kill the process. Pull the plug? That would work, except I’m using Juniper’s secure remote access (SRA) to edit a document on remote computer -the plug needing pulling is over 40 miles away. Disconnect and restart the SRA session? I log back in the find the document still scrolling on its own in my absence. I’ve little choice but watch my document scroll by until it gets to the end, which from previous experience I know is the point where SRA will snap out its OCD and return control to me. Hey, that’s only another… 93 pages. Crap, crap, crap.

SRA is a remarkable technology. It’s astonishing. A wonder to behold. I look at what it does and it boggles my mind. Yes, I’m using double entendre, but that’s how I feel. I really am amazed that I can fire up my laptop anywhere there’s an internet connection and get on my desktop humming quietly by itself in another quadrant of the galaxy, and it’s just like I’m sitting in front of it, except when it isn’t. Except when it slips off randomly into software psychosis, sometimes never to recover. Some antecedents to these failures are known, such as being so bold as to scroll, but many more are not. At any time, one false move, and I’m hosed.

I seriously consider not using SRA, but that’s a dangerous path to take. By the same logic, I shouldn’t be using computers at all. All of my experiences with computers is characterized by such inexplicable, unpredictable, intractable, and utterly infuriating failures. Using a computer means dealing with the fact that sometimes it just plain doesn’t work. Here for example are the current outstanding problems on my home desktop:

  • Crashes. About twice a month or so, my computer will randomly lockup, and I have to reset it. It spent two weeks back at its maker, and they determined that the crashes were due to Dynamic Overclocking being somehow turned on in the BIOS. If you Google for Dynamic Overclocking, you’ll find it’s a very clever technology for making your computer to crash. Turning off Dynamic Overclocking has reduced the frequency of crashes, but has not eliminated them.
  • Refuses to shutdown. Sometimes, it gets stuck because in can’t stop a process called CDirectoryChangeWatcherHelper. Once I came back from a long weekend to find my computer still trying to shutdown from days before, still waiting patiently for me to say, yes, please kill that stupid CDirectoryChangeWatcherHelper process. If you Google for CDirectoryChangeWatcherHelper, you’ll find it’s used by Nero Scout, which is a powerful media indexing utility that everyone is trying to get rid of.
  • Ignores my eSATA drive. This is particularly frustrating because I specifically chose an MSI K9A2 Platinum motherboard because it touted eSATA support. If you Google for eSATA, you’ll find that it’s the latest fastest way to transfer data to external drives that doesn’t work for an MSI K9A2 Platinum motherboard. A lengthy exchange with tech support for the motherboard yielded various BIOS flashes that failed to help, followed by advice to  use a pathetically slow ATA drive in an eSATA enclosure, which works just fine, they assure me, and entirely defeats the purpose of having eSATA at all.
  • Ordinary serial devices too. I mean, eSATA, that’s bleeding edge technology, so what do you expect? If you want reliability, you should stick with established technologies, like a good old fashioned 9-pin serial port, also supported by the MSI K9A2 Platinum motherboard like eSATA, in the sense that it doesn’t work either.
  • Evaporating video. Oh, here’s a new one. Recently my video display spontaneously shifts from a 32-bit depth to 8 bit. I nudge the computer out of screen saver to find everything dithered, rendering things hard to read. The Windows Display properties claims I’m using 32-bit, but the appearance is clearly limited to 256 colors (not that I actually counted them). I have to re-boot.
  • Missing user. One of my Limited Account users has spontaneously disappeared from My Computer in Windows Explorer when I’m in an administrator’s account. Now when I need to access files from within that user’s folders, instead of getting there in one click, I have to go through Local Disk (click) Documents and Settings (click) user (click) My Documents (click). I don’t know how to get the user back on My Computer, except maybe by reinstalling Windows, which could (a) break something else, and (b) not actually work.
  • Blocks access to files over the network. The desktop has a couple folders set to be shared over the home wireless network, but the laptops refuse to acknowledge that they exist. Network places set up on the laptops self-destruct between sessions. Occasionally on a laptop, Windows Explorer will show the files in the shared folders, but won’t  let me open any of them, or it’ll let me open them, but not save them. My home network is very useful as long as I don’t need to network through my home.
  • Anti-virus goes catatonic. Once in a while I open the window for my Trend Micro anti-virus software to find that no one is home. Clicking any button freezes the window and it has to be killed via the task manager. Of course, I have to wonder if such failures are limited to user interactions -maybe it’s equally inert when confronted with malware.
  • Solid modeler doesn’t. The primary reason I’m still using a Windows machine is for my IronCAD solid modeling program. It, however, is simply unable to handle what I regard as pretty basic shapes, complaining that the geometry is too complex for it’s pitiful little ACIS kernel to handle. Oh, yeah, and sometimes it crashes.
  • Permanently disabled functions. Maybe the problem is that I’m paying too much money for my apps. Maybe I should use more open source software, which provides greater reliability, expect when it doesn’t. My open source Nvu HTML editor has decided that never is a good time to Paste Without Formatting. The menu item for this function just one day decided to become disabled and has stayed that way no matter what is on the clipboard. Just checked. Yup, still disabled. Foxmail, my freeware email client has a similar problem, perhaps inspired by the HTML editor. Copy is disabled permanently on the Edit menu for any HTML email, except it can’t even get that right: the particular menu item is disabled, while the function is not, working fine if I use Ctrl-C or a context menu.
  • Single sort order for pictures. In my other installations of Windows XP, Picture and Fax Viewer shows pictures in the same sort order as set in the parent Windows Explorer window.  On my desktop, however, it always uses alphabetical order, which confuses me (since I use Explorer together with Picture and Fax Viewer), and inconveniences me since often I need to go through pictures by date or file type.
  • Sound skipping. When I play music through Windows Media Player, sometimes it skips once at the beginning of the track, just to annoy me.

A User’s Response to Software Failures

Working with computers means continuously facing the same decision regarding failures:

  • Fix it.
  • Work around it.
  • Live with it.

Fixing problems first means finding the fix, which is only remotely possible through the miracle of Google. When I bought the desktop over a year ago, I came to the surprising realization that one cannot have a computer without an internet connection. What, with all the problems out of the box with the BIOS, drivers, and software installations, one would never be able to get it to work without access to knowledge bases, tech support email and chat, downloads, and self-help forums of users. The same services that the computer provides for us are necessary for the computer itself to function. Not that access to the internet is any guarantee of function, as my experience with eSATA attests. Sometimes it is a waste of time, and when you’re talking things like BIOS flashing and registry editing, it can even make things worse. Fixing is not always the optimal response.

Working around the problem often provides a better chance of success, but it means suboptimal and odd user behavior. I put a shortcut to the Limited user’s folder on the Administrator user’s desktop, and it works fine, but it also adds clutter and is one more weird idiosyncrasy to remember. To move a file between the laptop to the desktop beside it, I email the file, akin to using FedEx to leave a note for your housemate. Someday someone will see me do my crazy ritual with the HTML editor of pasting text to Notepad only the cut and paste it into the editor as my work-around for the disabled Paste Without Formatting problem. He’ll say, You can paste without formatting. I’ll reply, No you can’t. Yes, you can, he’ll say, and pulldown the Edit menu to reveal a miraculously restored and healthily enabled Paste Without Formating menu item. And I’m just a total luser.

Then there’s living with it. Sometimes the problem, like the skipping in music tracks, isn’t worth the effort try to fix or workaround it, partly because of the high uncertainty of the amount of effort necessary. When my wife asks me to fix something on her computer, like upgrade the antivirus, for example, she asks how long it’ll take, and my answer is always the same: I don’t know. If everything goes right, it’ll take 10 minutes, but if something goes wrong, it could take all day. Based on what the web forums say, for example, fixing the CDirectoryChangeWatcherHelper problem sounds easy enough, but that doesn’t necessarily mean it will be. It’s not the thing I’m going to want to try when I have other things more important to get done that day. As it is, it seems I spend a substantial portion of my computer time fixing and maintaining the computer. Owning a computer sometimes feels like owning a car so it’s easy to drive to the mechanic to repair it.

The Human Factor in Failure

This a far cry from the future world we expected. Going back 40 years, computers were regarded by the general public as the way to overcome human weaknesses, able to quickly and objectively process vast rivers of data without a single error, in stark contrast to a human being. The sci-fi computer HAL from Stanley Kubrick’s 2001: A Space Odyssey represented what people anticipated in 21st century computers: perfection. An intelligence superior to that of the human brain, incapable of error and utterly reliable. Okay, so HAL had one tiny little glitch that in the context of interplanetary travel turned it into a ruthless serial killer, but even there you can debate whether it was a bug or a feature. There’s a certain logic to the idea that contact with extraterrestrials shouldn’t be trusted to something flaky like a human when there’s a better option like HAL on board the same spaceship. However, if HAL were like a real 21st century computer, it would’ve killed all the Discovery’s crewmembers not out of malice but accidentally while trying to fluff up their pillows or something. The creators of 2001: A Space Odyssey conceived of HAL as a product of humans and therefore the technological leveraging of human qualities, both the good and the bad: our intelligence and rationality, but also our fear and aggression. What we actually got by the year 2001 are computers that have leveraged the very human flakiness that HAL distrusted. With our advanced technology, we now expend far less effort making far bigger mistakes, to take a variation on Horowitz’s rule.

We can blame computer software for its faults, but the root cause is the human factor in their creation. Modern computer programs are complex and coupled systems, virtual mechanisms with countless of parts having countless interdependencies. In the early 1980’s you could read one book by Peter Norton and know pretty much all there is to know about programming a PC. Today, it is not humanly possible to know everything there is to know. Even a single application is often too complex for any one person to understand completely. We’ve attempted to solve this problem much like we did for the integrated circuits that run the software, by subdividing the knowledge space, having different developers responsible for different things, and thus building systems as components and layers. In theory the only people that need to understand the  hardware are those who make drivers. A desktop application developer doesn’t really need to know the operating system, only the API. An web app developer doesn’t need to know the platform, only the web standards. Individual applications are divided into encapsulated classes and assigned to different developers. Sid doesn’t need to understand what’s in Pitr’s classes, only its public functions. But it appears this hasn’t worked. The components and layers interface with each other producing unanticipated dependencies that result in failures. The dependencies are so complex as to be unfathomable, and bugs are widely regarded as inevitable. The idea of error-free programming is literally a joke.

Google search shows

Couple the human cognitive limits in understanding modern software with the human motivations resulting from the organizational context, and such failures are inevitably frequent. There are techniques for creating highly reliable software, but these are too labor intensive to be economically practical for consumer software. In a market that rewards features, capabilities, and early availability, apps are sold with known defects. The concept of a product being in “perpetual beta” is hailed as a development innovation. Legal protection is assured by well-written EULAs that basically say, “you can’t sue us if, say, this software turns into a ruthless serial killer.” With EULAs openly absolving the software from actually working, it’s perfectly legal to sell software DVDs that are perfectly blank. I suppose the only reason software makers don’t pursue this potentially lucrative business model is because, while consumers take a lot of crap from software makers, even they might catch on to something like that. Instead, consumers rightly demand (and receive) free patches and tech support. Users don’t even expect computers to work anymore. One high-end user buys only the most common hardware configurations, not because they’re more likely to be reliable but because the solutions to their problems are more likely to be known in the user community. From a capitalistic perspective, maybe it all balances out. Outsourcing tech support and posting patches is more profitable than getting the software right in the first place. Users would rather have something that works 95% of the time now than wait a few years and pay more money for something that works 99.99% of the time. And the result is a balance:  Users get products like SRA that are just not quite frustrating enough to outweigh the benefits they provide.

A Coming Backlash?

But we may be soon approaching the tipping point in the balance. The software of each computer I’ve owned has been more problematic than the one before. Personal computers may be the first technology that has gotten less reliable with maturity. Aircraft and automobiles, for example, got more complex but also more reliable in their evolution, but not personal computers. The functionality provided by a PC is basically unchanged since 1998, but reliability over that time has, if anything, gotten worse. The technological future is looking less like 2001: A Space Odyssey, and more like Brazil, with its bloated, baffling, and literally buggy electro-pneumatic systems that seem only remotely manageable by an elite trans-legal black-hat. Back in the Days of DOS, a crashing operating system or production application was almost unheard of. Youngster users have it so easy these days. In order to lock up my DOS machine, I had to write the code myself. It was understood that Version 1.0 of anything may not be the most dependable, but by Version 2.0, users could expect complete stability. Today, upgrading doesn’t bring any sense of greater stability. Rather than representing an opportunity for new and improved reliability, functionality, or at least imagery, upgrading means going through an arduous migration and painful re-learning to enjoy the benefits of new and improved bugs. Upgrading has become viewed with trepidation rather than anticipation, something that has to be forced on an unwilling consumer.

Users nowadays regard software failures to be inescapable, and this low standard has made it possible for software companies to keep selling to us. However, it’s a mistake to ignore such consumer fatalism. People buy computers, but they also hate them, partly it’s poor usability, but partly it’s low reliability, and often users can’t tell the difference. All it takes now is a little nudge, and users will revolt. Microsoft released Windows Vista knowing the drivers to run it were not readily available, convinced that the “wow” of amazing visuals were more important than it actually working. After all, software is always buggy so “wow” is the only way to get brand differentiation. But the consumers were pushed over the edge. They weren’t buying it any more, literally.

You want “wow”? Forget about bright new graphics with reflections and transparency. Stop treating computer use like it’s passively watching a fireworks display. Working fine is now the exception -there’s your brand differentiator. I remember when I got my first USB external hard drive. Dubiously, I took it out of the its box and plug it in. And it worked exactly like it should. No drivers to install, no hardware scans to initiate or New Hardware Wizard to work through, no futzing with Disk Management in compmgnt.msc, no patches to download, no mucking in the Registry. I plugged and it played. You know what I said? “Wow.”

Usability/UX Solutions to Defective Software

So software is defective. What can  a lowly UI designer like you do about it? Maybe nothing. Or maybe everything.

A Less Lousy Experience

There’s not much you can do with the UI design to prevent defects, but you can do something with the UX to manage defects. Knowledge bases, self-help forums, tech support, and patches are among the means of managing defects, but from a UX perspective these should be designed to best serve the user. Too often they’re tacked-on second thoughts or out-sourced to the lowest bidder, regarded as a necessary price of remaining competitive. The UX of these services would be improved by basic attention to usability and customer service such as providing automatic patching, knowledge bases with effective search engines and useful articles, and tech support queues less than one minute long. The latter would do much for your tech support staff morale too: it’s should be no surprise that some already frustrated users turn abusive after waiting on hold for half an hour. The incentive structure for tech supporters should encourage solving problems, rather than getting rid of the user as quickly as possible. Forums need to be moderated by an engineer who cross-checks answers, weeding out useless or malicious ones, raising the profile of effective ones, and perhaps adding validated solutions and work-arounds to the knowledge base. Tech support, knowledge bases, forums, and patches should operate as a coordinated system rather than separate entities. The knowledge base and forum should be searched simultaneously, with matches to each differentially marked. Questions asked in a forum should be readily transferable to tech support and vice versa. Issues and answers uncovered by tech support and the forum should be fed to designers and developers to facilitate development of patches and documentation of work-arounds for the knowledge base.

With the application software itself, features can be included to minimize the impact of defects. Document and context information can be automatically saved, such as done by MS Office and Firefox, so that work is not lost in a crash. In the event of a crash, the app can automatically restart itself and restore the context so work can continue. Milestones can also be automatically kept, so the user can revert to an earlier version of their work if something goes amiss. Ideally, such an undo capability can be selective, allowing the user to erase one change that’s causing a problem without reverting subsequent changes.

You can provide a means to override automation by providing a UI to independently execute each step in the automation. In addition to increasing the flexibility of your app, it provides a work-around in the event the automation fails. For example, Windows could provide a means to manually add a user’s folder to My Computer to solve my disappearing user problem. Actually, maybe Windows could allow the user to add any shortcut to My Computer -that could be a pretty handy feature. You can provide a means for users to automate workarounds, perhaps through a scripting feature. If a necessary process won’t exit on shutdown, then let the user attach a script to the shutdown process that kills the process for the user. If Paste without Formatting refuses to work, maybe a user can make a script that does the paste-to-notepad-cut-and-paste-to-document trick. Inelegant, sure, but better than what I have now. Not every user needs to know how to make the script. Once one user has created a working script, the tech support system can help disseminating it.

One of the problems with dealing with computer failures is determining exactly what has failed. Take for example my problem accessing files on the home network. Is it a problem in the client laptop or the host desktop? Or maybe it’s the router? Should I suspect a component of the OS of one of the computers, which controls file-sharing, or the anti-virus software, which controls anti-file-sharing? Should I suspect a hardware problem, maybe in the router or network adapters? Or should I suspect MS Office, just out of principle? A few years ago, I had a network problem that was ultimately traced to the installation of my Norton Ghost backup software. I mean, huh? When problems span systems from different vendors, who do you call for tech support? Perhaps software can include better auditing features to track processes to better see where cross-system problems occur.

More comprehensive self-monitoring or auditing may even help in determining if a failure has occurred. Recently I regenerated the table of contents for a large MS Word document and found it including all sorts of random text beyond the headings one would expect. How did this happen? Prior to regeneration, I had recently pasted text from another document. Did that pasting also import settings that affect the table of contents generation? Is it a bug or feature? Even if the self-monitoring records are not interpretable by the average user, they may be helpful to tech support or patch developers. Microsoft has long had a feature where the user can ship information about an application crash to developers for analysis. It can even provide links to knowledge base articles or patches if the cause is known. That’s a feature many apps could use.

Office Outlook dialog to forward crash info.

Humane Development

All these approaches can help improve the user experience of failure, but they have two disadvantages:

  • They don’t actually stop failures from occurring. They only make recovery a little less miserable.
  • They may cause failures. Code needs to be written for features like auto-save, milestoning, auditing, patch downloading and installing, forums, knowledge bases, and even tech support (e.g., when using chat rather than a phone). Each of these may themselves fail, compounding the problem. Automated patch downloads, downloadable scripts, and auditing systems also present potential security and privacy threats, made more likely by the inevitable defects. If software complexity is the problem, adding more complex software may create more problems than it solves.

If we’re going to eliminate software defects, then we have to look the root cause of  them. As designers, we’re not going to increase reliability by re-designing the users’ experience. Instead, we have to re-design the process for developing software. Usability engineering specifically and human factors engineering in general is about making systems compatible with the abilities and limits of human beings. If the root cause of the problem is code complexity that defies human comprehension, then we can say that we’ve created a system that is incompatible with the humans that work in it. Software development is downright not humane, to use Jef Raskin’s definition.

At this stage I don’t know what a redesigned humane software development process will look like. Maybe it will involve a new programming language, designed to be error resistant or error tolerant. Certainly we can make errors like buffer overflows a thing of the past, along with silly syntax errors like confusing the “=” and “==” operators in C. Maybe it will involve new development environment software, one where the developer can better visualized programs at scales larger the line-by-line details to better assess interactions among components of a program. Maybe it’s a matter of having the right process or practices for designing and coding apps. Some effective processes and practices may already exist, and it’s primary a problem of training the developers to use them. Maybe it’s more about the process for communicating among developers, making it easier for each to appreciate the full implications of interfacing each other’s work to create a single app. Maybe it’s more about the work environment, such as providing coders with quiet private offices and outlawing death marches. Certainly software development is inherently intellectually challenging, but it’s entirely possible that we have created tools, procedures, and an organizational culture that makes it considerably harder that it has to be.Currently software development tries to prevent defects from reaching production by subjecting software to testing, sometimes through code reviews, but primarily by functional evaluation by teams of test engineers. Each detected failure is analyzed to determine the underlying defect in the code, and the code is corrected. It appears that software development is following a quality control model from the early twentieth century physical manufacturing, where one sifts through the end product, measures each item against specs, and weeds them out. Worse, the test-and-fix approach to code quality doesn’t even test every item. There are too many combinations of use conditions to test them all and test plans themselves can have errors. Where physical manufacturing can rely on statistical samples to effectively control quality, the sampling of software functionality in test plans doesn’t work that way. A defect in one line of code is not used to infer defects in other lines, and maybe it shouldn’t be (or maybe it should -I don’t know if anyone has ever checked for correlations). In any case, testing and related approaches have the same weakness as forums and patches for dealing with failures: they don’t prevent defects from occurring but rather they try to catch them after they occur (but before production). I remember finding a book once that claim to help you make bug-free code, but when I opened it, I saw that it was essentially a tutorial on using ASSERT statements that are checked at runtime -it didn’t help you make bug-free code; it made your bugs easier to track down. Code reviews and testing will probably always be a good practice, but relying on them alone appears to be inadequate.

Perhaps the real software development process to adopt is the process continuous improvement, of tracking failures, analyzing them for trends and root causes, then changing the development process to prevent them in the future without simultaneously creating new problems. Maybe the testing process shouldn’t stop with determining the code defect behind a failure. Maybe it should dig deeper to understand what led to the defect in the first place. ”Coder had a brain fart” is not an acceptable explanation. What caused the fart? Why wasn’t there a fart-catching mechanism in place? Analyses can be performed on aggregates of failures to determine patterns in developer errors that may be traced to the development process. Was there  incorrect or inadequate team communication? Ambiguous representations of code or requirements? Lack of redundancy, cross-checking, or oversight? Problems in training or documentation? Poor division of labor? Pressures to cut corners? Fatigue? Distraction? These are all characteristics of a process that any human factors engineer will recognize as contributing to human error. As characteristics of the process, they can be addressed by changing the process. Ultimately, what we want is a new way to engineer reliability into the software development process, to do for software what quality assurance has done for physical manufacturing.

At the very least, root cause analysis of software failures is necessary to know how to improve the development process. This can be done on an industry level (where it can be used to create a new programming language or other widely-used software), or within an organization (where it can be used to create new procedures and policies for coding, documenting, training, and communication), or even at an individual level, to improve your own coding.

In any case, the way things are, adequate usability will not be achieved without better reliability. An app designed for great usability does little good if it doesn’t perform in execution.

Summary Checklist

Problem: Failing, unreliable, unstable, defect-ridden software.

Potential Solutions:

  • Ameliorate the experience of failure for the enduser by providing easier recovery.
    • Integrated and usable service-oriented technical support, knowledge bases, forums.
    • Feedback of information between support and development.
    • Automatic patching.
    • Failure recovery built into the app with automatic save, milestoning, automation overrides, and diagnostic features.
  • Apply human factors principles to redesign the development process to make it compatible with human limitations.
    • Consider training, tools, organization and communication, and the environment.
    • Initiate a continuous improvement process to track failures, identify their root causes, and develop strategies to prevent future failures.

One Response to “Designed to Fail”

  1. Mike Bachman says:

    Zee

    I have often thought about this problem. HW designers have fixed this issue quite well.. Its called the IC and the Schematic. The IC is designed and tested and has a few or sometime many features… bu tthes efeatures are FIXED and therefore can be readily tested and perfected.. even an IC as complex as as a DSP or DualCore has been fully run through tests.. and these tests are standardized as a fully tested to catch bugs..

    ICs come with spec sheets and you can look throug the specs and pick the ones that world best. And these spec sheets confomr to a basic standard.

    And thenm you put them all together in a schematic.. something everyone can read and quickly understand

    Why this doesn’t work for the SW world, IMHO, is that SW dudes like to mess wit h everything.. they get a DLL or open suorce module, that may well ne 100% perfect.. but they tweek it so it is “better”… A HW guy CAN”T tweek the IC, he uses it as is. with all its limitations.. he just uses it. SW guys don’t do that.. they chnage everything and thus everything is suspect.. and there has never been a good schematic for SW..

    Find these solutions and SW will become as easy as HW is now… thiung back 30 yeasr and how hard it was to build a working computer.. now it is on one chip.. power it up and it works 100% perfectly.. very time.. Treat SW like HW and 20 years from now we will finally have SW that works….