The first article you should read on how to measure your unit tests

Humans like measuring things. It reassures us. It shows us where we are good and where we are bad. It gives us direction for improvement. In particular, when we are driving a project, a team, a company, numbers provide a pillar upon which to base decisions. 

As an ever-learning software engineer and a member of Ponicode, I spend more time than one probably should thinking about unit tests, and, of course, this means also thinking about how to measure them.

But what does it mean "to measure unit tests"? What are we trying to measure?

To answer this question, we must first ask ourselves why we unit test, and I believe that the answer is that we unit test to have confidence in the fact that our code does what it is expected to do. For the developer, the incentive is of course to produce a quality product, but also it is the promise that they won’t have to go back a few days, weeks, or months later to fix their code to accommodate an edge case they hadn’t thought about.

For a company, the incentive is… well, it’s always money, isn’t it? The cost of users finding bugs in production can be astronomical, therefore we must do all that can be done to identify problems earlier in the development cycle.

When measuring unit tests, we’re trying to measure how much confidence we gain in our code from carrying out a unit test. 

The first and most known method for measuring unit tests is without a doubt COVERAGE. 

Code coverage is an indicator of what percentage of a function, a file or a project’s code is executed when one or more tests are run. 

Code coverage is so widespread as a metric for unit tests that it is integrated in most test runners, appears on most dashboards, and is thrown around as a buzzword whenever one talks about code quality. 

One of the main advantages of code coverage is that it’s relatively quick and easy to calculate, even for large projects. It’s sufficient to run all unit tests to retrieve a nicely packaged code coverage report.

The calculation is generally split into three: line coverage, branch coverage and statement coverage. 

  • Line coverage can be said to be the least smart of the three, being satisfied with telling us the percentage of lines of code executed. It doesn’t really look at what’s inside the line, it simply ticks a box when your program passes through them. While a bit simple, line coverage is very useful when starting to write a test to spot exactly what portion of code is completely untested
  • Statement coverage is very similar to line coverage, except that it is able to detect when there are multiple statements on the same line.
    For example, `const age = 00; console.log(age);` is one line but two statements. As you can guess, in most situations this is not a game changer.
  • Branch coverage is probably the most interesting of the three: in a piece of code that has lots of ifs, loops and other conditional statements, branch coverage highlights whether our code has entered all of the available branches. This is useful in situations such as ternaries: an expression such as `n > 0 ? “some” : “none”`: statement and line coverage would consider this statement to be always covered. On the other hand, branch coverage is able to highlight if my tests accidentally test a situation where `n > 0`

These three dimensions of coverage appear to provide a pretty comprehensive overview of the health of our tests. In reality, though, code coverage tells us more about what we don’t test than about what we do test and how well we test it. Let me explain.

If, for a given function, I obtain a report of 60% code coverage (be it line, statement, or branch, it doesn’t matter), I can be completely sure that there is some code in my function that hasn’t been executed at all. This is precious information, and pretty actionable: the minimum I can do is go write a test that executes the leftover lines or branches.

If, instead, I have 100% code coverage (let’s assume all 3 kinds!), all I can deduce is that somewhere at least one test executed all my code. But I have no additional information on whether my tests are any good, and on whether I can rely on them to know that my code is going to behave how I’d like it to.  Let’s take a couple of practical examples:

1. Code coverage is unaware of use cases



The three test cases above provide me with an all-round code coverage of 100% for the function `getPricePerUnit`. If I trust it, I can commit and deploy my code.

When a friend comes to visit and I need to know the price of an item for him, however, I will end up disappointed that no discount at all will be applied. Why? Because I made a mistake on the spelling of “friend”, and so my use case of a friend's price was never verified. 

2. Code coverage doesn’t care about what you assert



Once again, we have a test that gives us 100% code coverage. But what are we really testing? Here we are just making sure that our function returns a truthy value, which doesn’t give us any confidence in the fact that the returned value is exactly the one that we need. 

Code coverage only checks that somewhere in our test suite, the function has been called. What we do with the output of this call is completely up to us. So much so that if we didn’t write an expectation at all, the code coverage report would still tell us we’ve hit the max!

3. Code coverage doesn’t reward us for having more than one test



As you see in this example, it’s not as hard as we’d like to think to write a fake test. If we trusted code coverage, we’d say that the `double` function above is perfect: 100% lines, statements and branches have been covered. And - yuppee - with only one test! Amazing! Except that, if I tried to add another test to my suite, I would quickly realise that my function is actually very far from calculating the double of a number. 

If we always took the minimum number of tests necessary to achieve full code coverage in a function, more often than not, we would leave untested relevant and realistic input scenarios.

In summary, code coverage is very good at telling us which portions of code haven’t even seen the shadow of a test. Instead, if we need a good measure of the quality of our existing tests, relying on coverage alone would likely result in more than a few nasty surprises. Its success owes a lot to the fact that it’s easy to automate and fast to run, even on a large project.

An approach to measuring unit tests that we can turn to when code coverage fails is USE CASE COVERAGE, which is also sometimes called TEST COVERAGE. Use case coverage is a technique that requires identifying the different functional scenarios that a function could need to respond to. According to this approach, the necessary and sufficient number of test cases to guarantee full “coverage” of a portion of code is given by the list of acceptance criteria the code is supposed to meet.

If we take the function `getPricePerUnit`, which we saw above, its specifications would go a little bit like this:

  • For purchases of more than 50 items, standard customers pay 6 euro per item
  • For purchases of between 31 and 50 items,  standard customers pay 8 euro per item
  • For purchases of between 0 and 30 items, standard customers pay 10 euro per item
  • For purchases of more than 50 items, friends pay 6 euro per item
  • For purchases of between 31 and 50 items, friends pay 8 euro per item
  • For purchases of between 0 and 30 items, friends pay 8 euro per item

This tells us that to achieve full use case coverage we need at least 6 tests - each of them corresponding to one functional use case.

We see straight away that some of these tests could be considered redundant if we simply considered code coverage: for instance, we could argue that the price is the same for friends and standard customers when the quantity is highest. So why test twice? 

The reason is simple: use case coverage doesn’t care much about how my function is written. Its purpose is to ensure that the code supports all scenarios it’s intended to support according to the specs, and that it continues to do so even if the code is refactored.

100% use case coverage is achieved when we have placed enough tests to guarantee that our code does exactly what it is supposed to do in all use cases. Does this sentence sound familiar? It’s pretty much how we’ve defined our goal in measuring unit tests.

But if it’s so amazing, why haven’t we heard more about it then? Well, use case coverage has two pitfalls: 

  1. It requires either specific knowledge of the functional requirements of a function or portion of code, or a list of specifications so precise that it truly covers all the possible uses of a function. In the ideal world, this would always be the case. Too often, however, it isn’t.
  2. The calculation of use case coverage cannot be automated. There is no magic formula or script that can whip up a score for our function, file or project. It’s up to the developer to manually compare their test suite to the specs or to what they know that the function should do in every situation.

Effectively, use case coverage is probably the best possible measure for the quality of our unit tests, but sadly it is not the most useful one in most situations. Exploiting use case coverage at scale is effectively impossible: no tools can automate what only expert knowledge of well prepared specs can do. 

Finally, a third possible measure of the quality of our unit tests is MUTATION SCORE.

Mutation testing is a relatively recent technique which involves creating several copies of a function or program, each featuring one small modification. Each modified version is called a mutant.

The theory goes that when an existing test suite passes for the original code and also passes for the mutant, it means that the test cases in the suite are not granular or precise enough, and so their quality is low.  On the other hand, when the test suite fails for the mutant, we say that the test has killed the mutant - meaning that the test has done what it was supposed to do.

The mutation score corresponds to the percentage of mutants that the test or test suite was able to kill: 100% mutation score means our function is very well covered, and that therefore we can be pretty confident that our function does what it’s supposed to do. Contrastingly, a low score means that we shouldn’t rely too much on our tests.

The great thing about mutation testing is that it’s not just analytical, but prescriptive too: the list of mutants that our suite was not able to kill is an excellent to-do list for improving our unit tests and augmenting them where needed.

Let’s take a look once again at the function`getNumberType`, which we very poorly tested with a series of `toBeTruthy` assertions. 



If we decided to completely modify (mutate!) the strings “positive”, “neutral” and “negative” with, for instance, some positive numbers, we would still find that all of our tests pass. Mutation testing maintains that the fact that the tests still pass despite a significant change in the source code, rather than being a positive sign (we are all tempted to be happy we can see green everywhere), means that the quality of our tests is low. In this example, I believe we can all agree this is a fair assessment.

Effectively, mutation score is nothing but an empirical method for trying to assess the quality of unit tests. It is not infallible either: it all depends on the quality of the mutations that we decide to apply. Creating several intelligent mutations for each function we test might seem like a complicated and time consuming process, but thankfully some tools exist that do this for us, and provide us directly with a score and a breakdown of all the mutants that could not be killed. 

The performance of mutation testing tools are not comparable to that of code coverage tools, because the number of tests that must be run to assess the quality of a single test suite becomes higher the more mutations we introduce. For complex functions, the number can be extremely large. For this reason, mutation score is probably not the best score to use - for instance - in a CI pipeline: running it several times a day or week could consume a lot of resources. However, using mutation score punctually on a file or microservice can be extremely insightful.

In summary, where code coverage is mostly a quantitative measure which can be automated easily, and use case coverage is a purely qualitative approach which is impossible to automate, mutation score can be placed nicely in the middle: it provides a measure of the quality of a suite of unit tests, and at the same time it can be automated.

What can we draw from all this?

In all honesty, it would be wrong to try to pick a “winner” from the three different measurement techniques above, as each of them is particularly useful for a specific use: coverage is fast and dirty, use case coverage is pedantic but precise, mutation testing is a little random but pretty effective. 

Picking use case coverage for assessing the unit test quality of a large legacy project would be silly. Similarly, expecting code coverage to give us the same depth of judgement as mutation or test coverage when testing a single function would also be misplaced trust. 

At Ponicode, we are in the business of providing developers with a nice selection of test cases for each function they want to test.  We want our AI to generate test suites that look and feel like those you would write yourself on your best possible programming day: they need to be relevant to your use case, they need to execute all of your code, they need to be complete in their functional scope, and they need to feel natural to you. And they shouldn’t be too many, as nobody wants a suite of 200 tests. All of the above techniques are useful to us every day in order to do just that, as we train our AI and assess the progress of our unit testing tool. I just hope that some of this knowledge can become useful to you too.

To find out more about code quality and unit testing, check out the other articles on our blog right here.

Green blobred blob