A/B testing tells you what you should do differently

A/B testing is a statistical method and it can bring tangible business benefits. A/B testing measures the impact of a single change on the end result: what is worth doing differently?

“In A/B testing, it is important to know what is being done and to decide the objectives in advance, so that we will not come up with excuses and interpretations after testing,” says Timo Mikkolainen at Compile, who has many years of experience in creating and testing demanding projects. During his career, Timo has worked with such companies as Nokia and National Land Survey of Finland, and he took part in a large project transferring the driving licence authority from the Police to the Finnish Transport and Communications Agency. “It was a perfect example of a public IT project that no one ever heard of because nothing went wrong and the budget held,” Timo says. For me, the transition to Compile was surprisingly easy – the “no bullshit” philosophy suits me very well.

A/B testing, what is it?

“You have something you want to measure: does one individual thing affect another?” says Timo. “There is something absolute that either happens or does not happen. The environment stays exactly the same in terms of time and place, but one variable is different. Will there be a statistically significant difference when the change is isolated?”

Of course, testing can be performed with several variables, but it is usually not worth it, because there will be too many moving parts and the sample size needs to be increased considerably. It is rarely worth it to run tests for a long time. However, the sample size must be statistically sufficient, otherwise there is no point in testing. But, it is not always easy to determine the sufficient sample size.

What is A/B testing used for?

“The fun part is to be able to allocate the right users to do the testing,” Timo says laughing. “A/B testing is typically used to help guide the user to do something – for example, make a purchase decision or find the help page to solve their problems as easily as possible.” The user is rarely a rational creature whose actions we could just predict. A/B testing can be used to study micro conversions, for example can a notification on a mobile app cause a customer to move a step closer to buying in an online store?

“When you start experimenting with the differences, the most important thing is not to assume anything in advance,” Timo advises. “Let’s just try all sorts of things and see what works. This works, that doesn’t.” For example, the user may be interested in something completely different from what was initially assumed, and this can be determined by testing. Price, colour, speed?

What is the benefit of A/B testing?

The people in charge of the purse strings might think it’s going to be expensive to undergo such testing. “A/B testing has tangible business benefits,” emphasises Timo. “It is therefore important that the testing team have been given a mandate from a high level. The ideal thing would be for the team to include both management and data analysis expertise.” This is especially important when testing is used to make corrections to existing processes, such as making UI changes. The correct data cannot be accessed without a permission from higher level.

“There’s no point in following the numbers if the numbers are wrong. When done right, a single test is not expensive in comparison to the potential benefits,” says Timo. “The tests usually aim for 95% certainty. If the result is statistically significant and positive, the variable under testing works. However, A/B testing is not only done for one thing. We are gathering information about customers. Preconceptions can be completely overturned.”

Continuous and resilient development

A/B testing is often a continuous development: searching for the problem is and persistently converting bad figures into good figures by optimising the process. In regard the tests, the team comes up with several different ideas and calculates how long a test should be run. It is common to look for ideas by talking to people outside the team – and, of course, to traditionally explore how others have done this.

In e-commerce, for example, Amazon and other big online stores are good benchmarks. “Although I’d rather not copy those dishonest methods, dark patterns, but the good stuff,” Timo points out. “When you look at the means that large online stores use, for example, they strongly direct the user to register instead of logging in: this does not affect the number of logins of old customers, but significantly increases the number of new customers registering.” Negative effects should also be taken into account: one moving part can affect many things. All indicators must be monitored, at least superficially.

A/B testers are usually not really part of the development team, but they are their own separate task force. “Let’s be as agile as agile can be.” The result of the testing team’s work is used to set requirements for actual development. “This doesn’t make anybody a coding hero, so your ego should be put aside. The code for testing is completely disposable,” says Timo.

the eNPS calculation is based on the Employee Net Promoter Score formula developed by Fred Reichheld, which was originally used to study the customer experience and customer satisfaction of companies. Lately, it has also been used to research employee satisfaction (e as in employee + NPS).

This is how the calculation is performed.

We ask our employees once a year, “How likely are you to recommend your workplace to friends or acquaintances on a scale of 0 to 10?” Then we ask for clarification with an open question: “Why did you submit this score?”.

Those who submit a score of 9 or 10 are called promoters. Those who submit a score from 0 to 6 are called detractors.

The eNPS result is calculated by subtracting the relative percentage of detractors from the relative percentage of promoters. Other answers are allocated a score of 0.

The calculation results can be anything from -100 to +100. Results between +10 and +30 are considered to be good, and results above +50 are considered to be excellent.