The Testing of All The Testing

We’ve come so far. Probably too far.

The beginning for me was 1999. After a year spent designing web things with no discernible business value, or worse — with negative impact — I tried something to be safe. I ran the next big, homepage redesign as an actual A/B test against our control. We did it the old fashioned way — proxy server, log files and spreadsheets. And, when that new homepage won — when it proved to be statistically different and better — I thought I’d beaten the system. It didn’t feel like experimenting or gambling. It felt like I was ten years old and I’d hit a fastball as far as I could. I wanted to keep having that feeling for the rest of my life.

And so, for the next two decades, I increasingly dedicated my career to design experimentation. That journey was supercharged around 2012, shortly after the publication of “The Lean Startup” and the birth of Optimizely, when I started Clearhead — a company dedicated to experimentation in UX and Product design. For the better part of a decade, and through thousands of experiments, I became an increasingly vocal and visible proponent of continuous experimentation as a design process. I was convinced that design and experimentation were actually one in the same, but that the most successful companies spent the time to articulate the hypothesis and test its validity whereas everyone else just experimented without the rigor or benefits. I became so subsumed with this thinking that I was one of a growing number of voices shouting “test everything!”

The truth and untruth of “test everything”

And yet, almost as soon as I was converted, I suspected that there was something not quite right with our lovable cult and our clickable slogans. The first pangs of dissent began in response to the endless march of PowerPoint case studies boasting high double (even triple!) digital performance improvements. While I’d seen marked statistical difference on lower traffic landing pages, I’d never seen results of those proportions carry down to a bottom line. And, candidly, the best results I’d ever seen tended to regress after a short reign of supremacy. I sensed there was something “snake oily” about these case studies, but I soldiered on, chalking them up to exuberance.

Next, I began to hear loyalists saying that every experiment — win or lose — was a winner because the learning that comes from every experiment carries great, intrinsic value. As an entrepreneur and business owner, this claim pressed my gag reflex. I knew all too well the value of time and resources. I knew that opportunity costs are disproportionately high for a startup. I also knew that, for most A/B tests, businesses tended to look simply for winners and losers and less for why, how and what next. In short, I was always disappointed at how little people learned from their experiments. But mostly, I knew that not all experiments were equally valuable. And that most businesses were testing based on the charm of the hypothesis rather than the impact of the problem. Without sufficient understanding of the root problem, I wondered how all experiments could be (equally) valuable. Wouldn’t it be more valuable to test against the problems most worth solving rather than the hypotheses that cost the least but appeared the most “different”? I started to quietly call “bullshit.”

Finally, there was that word — “everything.” Did we really mean to test everything? Every piece of content? Every design? Every feature? Every product? Every service? I was convinced that every business decision — from floors to ceiling — was an experiment. But was I equally convinced that everything should be tested? The short answer was: “yes.” The longer and more complicated answer was: “No — not every change should be A/B tested. That’s impossible. That’s ridiculous.” Over time, I found myself at odds with the snowballing trends. Test this. Optimize that. More. Faster. Rinse. Repeat. My ambivalence had very little to do with my passion for experimentation and the gathering of evidence. It had almost everything to do with that single word: “test.” Unbeknownst to me, by 2015, the words “test” and “experiment” had become increasingly synonymous. A new generation of digital practitioners had kidnapped the word “test” and conflated it with controlled “A/B test” or “Multivariate test” as though there were no other means for testing a design or product or concept. Further, the purpose of controlled experimentation — to validate and/or learn — had been eclipsed by a blinding desire to chase the upside and protect the downside.

I’m not sure exactly when it happened, but at some point, the phrase “test everything” came to mean “run as many A/B tests as possible, in the browser/client, manipulating the DOM through SaaS, in pursuit of maximum improvement to conversion rates.” This was more than a trend. It was more than a mantra. It sounded like settled fact. And yet, it was so obviously illogical and such a disservice both to the practice of testing and the sub-practice of experimentation. What did this phenomenon imply about usability testing? What did it imply about surveys? What did it imply about market testing? What did it imply about focus groups? What did it imply about tree testing or card sorting? What did it imply about server side testing? My questions were mounting but, amidst the SaaSy elation, it felt like nobody was interested in answers.

A square is always a rectangle

Forgive me, but maybe let’s get definitional for a moment:

Test /test/

a procedure intended to establish the quality, performance, or reliability of something, especially before it is taken into widespread use.

Experiment /ikˈsperəmənt/

a scientific procedure undertaken to make a discovery, test a hypothesis or demonstrate a known fact.

So, while similar, a test and an experiment are not the same thing. Most digital experiments function as tests, but the inverse is frequently untrue. In fact, I frequently see businesses employ A/B testing not to validate a hypothesis (often the hypothesis is not clearly stated) but to simply ensure the limits of failure or to “win.” In these cases, the experiment is more a test or an insurance plan or a contest than it is an actual experiment. On the other hand, when people conduct usability tests or focus groups or surveys, they are almost uniformly testing a product, service or concept, but they are not actually running an experiment. In either case, the methodologies can be sound or biased. And my point is less about the supremacy of one methodology over another and more about the increasingly lazy and mechanized thinking that has infected the practice of experimentation. Over the years, I’ve seen too many businesses deliriously swinging blunt hammers at a set of nuts and bolts.

There are many ways to test. They each have different benefits and costs. There are also many ways, though necessarily fewer, to experiment. So, why am I being such a crotchety purist about this distinction? Because, a decade after The Lean Startup and the second wave of experimentation software arrived, we’ve lost the thread. And, in doing so, we are wasting a lot of time and money. We are being the very opposite of Lean. 

The cost of confidence

To state the obvious: A/B testing is functionally the highest confidence, most statistically rigorous method for determining the validity of our hypotheses. When done properly, it can be both a test and an experiment. And, over the last decade, the time and cost to construct an A/B test has come down considerably. However, it is not the only method for testing things and, frequently, it is not the best. In conflating A/B testing with every other method of testing, we have commingling the potential for highest confidence (when, in fact, most experiments don’t demonstrate high confidence, statistical difference) with every other conceivable testing benefit. The effect of this has been: “why bother?” Why bother with surveys? The A/B test will yield results. Why gather a whole bunch of user tests? Let’s just do one or two and then A/B test? Why build something that’s market ready and roll it out gradually? Instead, let’s code something “good enough” in the client and then rebuild it in production if the variant wins. 

While A/B testing does, in many cases, deliver high confidence results, it also — at least as I often see it employed — lags behind other testing methods in fundamental ways. A/B testing requires clear hypotheses. The clarity of that hypothesis often obscures other potential generative insights. Surveys, focus groups and user tests are often better at quickly getting to the “why” of a question than A/B tests. Similarly, it is easier to extract new or misunderstood problems or unconsidered hypotheses through other forms of testing. And, while A/B tests are easier to deploy now than they ever have been, that is not the same thing as being “easy.” And certainly not the same thing as being “cheap.” 

Though success or flat performance is less expensive than a failure, A/B tests — when executed with care — require design resources, and engineering resources, and analytics resources and, in all likelihood, research, project management, product management and, possibly, marketing resources. That’s a lot of human resources. Additionally, while A/B tests can be quick to construct, they can require long durations and large numbers of users to engender confidence (much less statistical difference). As a result of the relatively high cost of experimentation (in compared to other methods), many businesses suffer from the bias of sunk costs. They call “winners” without sufficient evidence, they push the variant regardless of results or they define winning and validation simply as “not losing.”

Perhaps my biggest peeve with A/B testing, however, is the bluntness of the metric frequently used to test hypotheses with. Conversion rate — that elusive, holy grail of so many experiments. While I have always been obsessed with conversion rate as a metric, I’ve been more obsessed with the quality of experiences and the cogency products. Conversion rate is the most popular metric selected for A/B tests and, yet, conversion is often an event that occurs some distance from the hypothesis, at a relatively low frequency, and based on numerous factors outside the control of the experiment. My interests are generally more elemental: are users able to get to where they want to get to and do what they want to do? Why or why not? I’ve always suspected that, if so, and if the message, product and price is desirable, that conversion rate increases will follow experience and product optimization. Inversely, I’ve long said that I have two, virtually foolproof methods for optimizing conversion rate: (1) dramatically reduce prices or (2) dramatically decrease traffic. I joke, but only slightly. If testing is as much about learning as it is optimizing, then A/B testing for conversion rate, when the hypothesis is actually about some other event or action, tends to diminish the benefit of the former while rarely attaining gains on the latter.

For the love of testing

I could go on (and on) about the ways and reasons why we over and mis-use A/B testing. But, the truth is that I love A/B testing. I love the method. I love its benefits. I love the technologies that enable it. I love that we’ve invested in the practice. But, I also love other testing methods. And I love learning and listening for growth almost as much as I do witnessing the benefits of a controlled experiment. Moreover, as a lifelong entrepreneur, I despise waste and inefficiency. I highly value the cost of opportunity. And, ultimately, I want to understand “why” things might work or not work as much as I want to chase confidence intervals.

I was so delighted last year when Nitzan, Janet and the WEVO team invited me to serve their company, in part, because they were addressing many of the problems I am describing here. They were obsessed with the “why” precisely as much as the “what.” They were chasing the eye-opening benefits of qualitative testing with the rigor and objectivity of more quantitative methods. If you consider the distance between running a few user tests and constructing an effective A/B test, there’s a whole lot of ground that was left unaddressed. With WEVO, I saw a new way to support learning and decisions without the time, cost and effort of A/B testing but with more rigor and confidence than other usability products in the market. More confidence. More “why.” Less effort. Less bias. It’s not the same thing as A/B testing. But it’s much more toolbox than hammer.

Ultimately, what landed with WEVO for me was a rethinking of why we test, how we learn and what we define as our burdens of proof. And whether or not we employ WEVO or other user testing or A/B testing tools (and I would hope we can leverage all three), I think it’s as important to ask ourselves these ten questions:

  1. What are we actually trying to learn?
  2. How can we best achieve that learning?
  3. How much evidence do we need to gather?
  4. What is the best indicator of success or failure?
  5. How long are we willing to wait to come to a conclusion?
  6. How much resources are we willing to spend on the test?
  7. Do we know what problem we are trying to solve?
  8. Do we have a clearly stated hypothesis?
  9. If our test fails or is inconclusive, what will we do next?
  10. If our test is successful, what will we do next?

Growth is achievable through continuous testing. I believe that. In fact, I know it. But you have to actually listen for growth. And you have to tune in properly. Sustained growth does not actually arrive (or last) through A/B testing for conversion rate. That’s the final mile — the validation of hypotheses derived through research, learning and design. Healthy growth is actually born from many forms of research and testing that help us identify root problems and clearer hypotheses which, in turn, help us design and build better experiences and products which, finally, through experimentation, yield increasingly better, and more confident, outcomes. By all means, test everything. And, yes, every new thing we try — whether we test it or not — is an experiment. But when we say “test and learn” let’s mean what we say and say what we mean.

Board Advisor, WEVO

Matty Wishnow is a serial entrepreneur and a member of the Board of Advisors at WEVO. Previously he was the founder of Insound.com and Clearhead as well as the Managing Director of Design at Accenture Interactive. Matty is curious about advanced baseball statistics, good design, music made in 1977, music made before (and after) 1977, vinyl records, great storytellers & middle aging. He lives in Austin, TX and Waitsfield, VT with his wife and three children.

WEVO in the news

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

More To Explore

WEVO Blog

User Research’s Battle Against Dark UX

How Democratization of Insights and Continuous Discovery Take on the Top Dark UX Patterns As digital products become more complex and the amount of user

Customer Stories

CXperts

Ready to Get Started?