Contrary to traditional test coverage i.e. line, statement, branch etc., which measures only which code is covered by tests, mutation testing checks that your tests are actually able to detect faults in the executed code, meaning it detects whether each statement is meaningfully tested.
Mutation testing is based on well-thought-out assumptions making possible to measure the mutation score to find the quality of a test suite.
One of the two assumptions of mutation testing is that a developer is considered competent. That is, the code he writes is correctly structured to do what it should do. The main source of error then becomes carelessness on the part of the developer. For example, he will omit a necessary line of code, or he will use one operator instead of another.
The second assumption is the so-called coupling effect. This effect claims that a large error is only the sum of several small errors. Thus, any large bug in a program can be explained by several small errors of inattention due to negligence on behalf of our competent developer.
Thus, considering these two hypotheses, we can simulate these “errors of inattention” by creating unit mutants, which are variants of this unit, but with missing or duplicated lines, or modified or deleted operators, etc. These mutants are supposed to simulate the errors of inattention in the programme i.e., the careless mistakes that a developer might make.
After obtaining a certain amount of mutants, we run our test suite on these individual mutants rather than on the original unit, and see on what proportion of mutants our tests fail. This proportion is what we call “the mutation score”.
A good way to learn a new concept is to use the simplest example possible. To do this, consider a function that takes two numbers and returns their sum. Let's write it in Python, but the idea applies to any language.
We want to robustly test this function. If we consider the coverage as our metric then we just have to test that add(0,0) is equal to 0 to have a coverage of 100%.
The problem is that if a developer makes the mistake of replacing the + with a -, our test suite will never alert us even though the behavior of the function will have completely changed - difference instead of sum. As we have seen, mutation testing consists of simulating this dramatic event before it occurs, in order to prevent it.
Let's create some mutants:
Here are our 4 mutants. We notice that our test (assert add(0,0) == 0) passes for all mutants except the last one. This means that despite our coverage of 100%, the mutation score is only 25% because only one mutant has been killed among 4.
If we add assert add(1,0) == 1, then our test suite will have a mutation score of 50% because the 3rd mutant will be killed but not the two others. Then we add(0,1) == 1, and we have 100% because the 1st and the 2nd mutants are killed.
Thus, while the test suite :
is perfect according to the coverage standard, we need to add 3 tests to make it perfect according to the mutation testing standard with our basic mutant generator.
This is a basic example where we see the interest of the mutation score as a measure of the quality of a test suite and of mutation testing in general.
A major flaw in mutation testing is that it is computationally extremely expensive. Indeed, it is necessary to execute the different mutants to know if they are killed and this can multiply the execution time of the test suite easily by a few thousand.
To solve this problem, there are what we call “Predictive Mutation Testing” approaches, that try to infer whether a mutant will be killed or not by the test suite from the dynamic trace of the test on the original code using statistical classification methods.
With this method, we don't have to run the test suite thousands of times but only once. The negative side is that we lose the 100% accuracy, but the state-of-the-art models have accuracy in the range of 70-80%.
Another problem is that the mutants are too naive and do not represent realistic bugs. For this, work is being done to generate more realistic bugs through machine learning methods or static analysis.
The Developer Experience team of CircleCI leverages the power of mutation tests to assess the true quality of your tests, this allows us to go beyond simple coverage, and explore the quality of test cases in terms of robustness.
You want to know more about the topics our data scientists explore to accelerate the developer journey? Explore the Ponicode blog.
For fellow developers we have more content about measuring your code quality