Mutation Testing: The Gold Standard in Testing

Today on the blog Hamza, Data Scientist at Ponicode, is going to introduce us to Mutation Testing, considered to be the gold standard against which all other types of coverage are measured.

Contrary to traditional test coverage i.e. line, statement, branch etc., which measures only which code is covered by tests, mutation testing checks that your tests are actually able to detect faults in the executed code, meaning it detects whether each statement is meaningfully tested.


Mutation testing is interesting because it is based on well-thought-out assumptions. These assumptions make it possible to measure the mutation score to find the quality of a test suite.

The Competent Developer

One of the two assumptions of mutation testing is that a developer is considered competent. That is, the code he writes is correctly structured to do what it should do. The main source of error then becomes carelessness on the part of the developer. For example, he will omit a necessary line of code, or he will use one operator instead of another.

The Coupling Effect

The second assumption is the so-called coupling effect. This effect claims that a large error is only the sum of several small errors. Thus, any large bug in a program can be explained by several small errors of inattention due to negligence on behalf of our competent developer. 

Thus, considering these two hypotheses, we can simulate these “errors of inattention” by creating unit mutants, which are variants of this unit, but with missing or duplicated lines, or modified or deleted operators, etc. These mutants are supposed to simulate the errors of inattention in the programme i.e., the careless mistakes that a developer might make.

After obtaining a certain amount of mutants, we run our test suite on these individual mutants rather than on the original unit, and see on what proportion of mutants our tests fail. This proportion is what we call “the mutation score”.

A Quick Example

A good way to learn a new concept is to use the simplest example possible. To do this, consider a function that takes two numbers and returns their sum. Let's write it in Python, but the idea applies to any language.

def add(a,b):

   return a + b

We want to robustly test this function. If we consider the coverage as our metric then we just have to test that add(0,0) is equal to 0 to have a coverage of 100%.

The problem is that if a developer makes the mistake of replacing the + with a -, our test suite will never alert us even though the behavior of the function will have completely changed - difference instead of sum. As we have seen, mutation testing consists of simulating this dramatic event before it occurs, in order to prevent it.

Let's create some mutants:

def add(a,b):

   return a - b

def add(a,b):

   return a

def add(a,b):

   return b

def add(a,b):

   return None

Here are our 4 mutants. We notice that our test (assert add(0,0) == 0) passes for all mutants except the last one. This means that despite our coverage of 100%, the mutation score is only 25% because only one mutant has been killed among 4.

If we add assert add(1,0) == 1, then our test suite will have a mutation score of 50% because the 3rd mutant will be killed but not the two others. Then we add(0,1) == 1, and we have 100% because the 1st and the 2nd mutants are killed.

Thus, while the test suite :

assert add(0,0) == 0

is perfect according to the coverage standard, we need to add 3 tests to make it perfect according to the mutation testing standard with our basic mutant generator.

This is a basic example where we see the interest of the mutation score as a measure of the quality of a test suite and of mutation testing in general.

Going further

A major flaw in mutation testing is that it is computationally extremely expensive. Indeed, it is necessary to execute the different mutants to know if they are killed and this can multiply the execution time of the test suite easily by a few thousand.

To solve this problem, there are what we call “Predictive Mutation Testing” approaches, that try to infer whether a mutant will be killed or not by the test suite from the dynamic trace of the test on the original code using statistical classification methods.

With this method, we don't have to run the test suite thousands of times but only once. The negative side is that we lose the 100% accuracy, but the state-of-the-art models have accuracy in the range of 70-80%.

Another problem is that the mutants are too naive and do not represent realistic bugs. For this, work is being done to generate more realistic bugs through machine learning methods or static analysis.

Ponicode & Mutation Testing

With Ponicode, you don’t need to worry about writing your own mutation tests as this is one of the techniques we use to assess the true quality of your tests, allowing us to go beyond simple coverage, thus ensuring the quality of test cases in terms of robustness.

To find out more about Ponicode for Data Science, click here and if you want to find out more about when you should be testing in data science, you can check out Hamza's other article here 🦄