Who wrote my unit tests?

Are you confused between Ponicode and CircleCI? It’s not you, it’s us. Ponicode was acquired by CircleCI as of March 2022. The content and material published prior to this date remains under Ponicode’s name. When in doubt: Ponicode = CircleCI.

A few weeks ago, I decided to introduce the Ponicode tool to a mathematician friend of mine who had a lot of preconceived ideas about how to solve the problem of unit test generation. She was very surprised when she saw that for a function that checks the syntactic validity of an email, Ponicode suggested “user@ponicode.com”, “user1+user2@ponicode.com”, “user@ponicode”, “user@ponicode.co.uk”, and “user.ponicode.com” as possible input values. “I thought the suggestions would look way, way more machine-generated!” she said. She then even questioned the artificial intelligence aspect of the product, because for her it wouldn’t be possible to have these types of suggestions if one followed the way she had modeled the problem solving in her head. But her doubt that it was “artificial intelligence” inspired me to write this article on one of the basic philosophical concepts of AI: the Turing Test.

What is thought?

The question of the possibility of human intelligence emerging from inanimate matter has always been central to the history of thought.

Let’s take the risk of oversimplifying things and attracting the wrath of our philosopher friends by saying that since ancient Indian philosophy (around 600 B.C.)[1], there have always been two opposing camps (with nuances in between):

The dualists, for whom thought is of a non-physical nature and therefore can neither be explained nor reproduced through matter.

The materialists, for whom thought is a physical phenomenon and can be fully explained, or even reproduced, thanks to physical principles.

What is thought?
René Descartes’s illustration of dualism. Inputs are passed on by the sensory organs to the epiphysis in the brain and from there to the immaterial spirit.

What about Alan Turing?

Alan Turing, who had already been working on intelligent machinery since 1941[2], officially asked the question “Can machines think?” in his publication “Computing machinery and intelligence” in 1950[3]. He decided to take a new approach to answering this question by transforming the basic question into “Can machines do what we (as thinking entities) can do?”[4]. By rephrasing the question in this way, he wondered above all what type of scientific experiment could answer the question of machine thinking.

Alan Turing
Alan Turing

Can machines think?

Turing first set a level of intelligence that the machine should aim to achieve. Of course he picked something easy, setting human intelligence as the standard. Usually, to get an idea of someone’s intelligence, having a discussion is one easy test. Turing imagined a test where a human interviewer would have a conversation with a human or machine (without knowing which it was). If the interviewer spoke with the machine but believed it was human, the machine would pass the test. In other words, a machine passes the Turing test if it succeeds in appearing to be human at least as often as human subjects.

The turing diagram
Turing diagram

Is that enough?

It is legitimate to ask whether a machine that passes this test can be considered intelligent once and for all. John Searle, the American philosopher, was not convinced and proposed an alternative. He imagined an experiment, called “The Chinese Room” [5], where an individual who has no knowledge of Chinese is locked in a room. He is provided with a book of rules for answering sentences in Chinese. This individual will then be able to answer questions without actually “understanding” any Chinese words. Searle claimed that a machine that could pass the Turing test would behave purely syntactically (symbol manipulation) in this scenario, unlike a real Chinese speaker who could do semantic manipulation (meaning of words).

However, this experiment suffers from a major limitation: the book given to the test subject  would need to be almost infinitely large to contain all possible sentences in Chinese. At that point the ability to extract the appropriate answers from an infinitely large book of rules could  be defensibly considered as intelligence. The fact remains that a simple combination of symbols is enough to solve the Turing test. Some see this as an inability of the Turing test to decide on intelligence, others think that the conclusion to be drawn is rather that semantics is precisely the ability to manipulate linguistic symbols despite all their complexity.We can indeed defend the idea that what we call “semantics” is only an emerging phenomenon of the syntactic processing of electrical signals by our neurons in our brain.

Despite this controversy, the Turing test remains a standard in artificial intelligence. This test has had vast influence on many disciplines such as philosophy, psychology, linguistics, neurosciences and so on. As of today, despite some very interesting attempts[6], the Turing test in the sense of Turing has not yet passed[7].

Chinese room input output system

What does this have to do with unit testing?

To understand the relationship with unit tests, let’s look at part of the Wikipedia definition: “Unit tests are typically automated tests written and run by software developers to ensure that a section of an application (known as the “unit”) meets its design and behaves as intended”[9] An application, in its natural environment, functions in interaction with humans. However, as explained in our blog post on code coverage[10], the quality of a unit test is often measured by the coverage; in other words by the proportion of lines of code executed during the test. This metric, although very useful, unfortunately remains insufficient. Ideally, a test suite should exhaustively cover the complete set of scenarios that a set of lines of code might encounter in its natural functioning. For this reason, we believe that the quality of a unit test generator can be measured using a Turing test where the generated test cases would be indistinguishable from the cases encountered in the production life of the program. According to our users, our unit test generator passes the Turing test.

A conversation with Blender, Facebook’s chatbot[6]

Ponicode’s opposing views

Let’s start by formalising (and demystifying) a bit how machine learning works. Broadly speaking, there are three  parts: data, an objective function and an optimisation method. More simply, we define a quantity (calculated from the data) to be optimised (maximised or minimised) using an optimisation method, and we call this a model. We notice that the theoretical heart of the problem is the definition of the objective function to be optimised. For the data and the optimisation method, let’s be crazy and say that these are practical details.

When we studied the state of the art of what is done in terms of automatic generation of unit tests, we realised that the objective functions were of two types: coverage score or mutation score. The coverage is, as explained above and in our blog post[11], the proportion of lines executed during testing. The mutation score is a bit more interesting. Let’s open a parenthesis to explain it:

We make the assumption that the developers are competent (we believe in you!) and that the errors they can make are only syntactical in nature. For example, putting a “-” instead of a “+”, putting a “>” instead of a “>=”, etc. We then take the unit under test and create several “mutants”. These are several distorted versions of this unit where we change one operator by another, where we delete a line, and any kind of transformation that simulates a syntactic error of the developer. Then we run our tests and see the proportion of mutants on which our tests fail (we say that the mutants were killed). This proportion is called the mutation score and its idea is that our tests succeed in detecting syntax errors in the code.

At Ponicode, after studying the habits of developers when they write their tests, we found these two objective functions very interesting but were not convinced by their sufficiency. Indeed, one only has to look at the unit tests generated by existing tools and compare them to tests written by developers to realise that the main difference is at the level of naturalness. The developer tries to write tests that look like production data while the current tools are completely blind to this. We then had to invent our own objective functions that we will not disclose in this article (see the image below). In addition to optimising coverage and mutation score, we also optimise naturalness metrics inspired by the Turing test.

Hint: we took our inspiration from this meme[12] #joke

Make Turing Test Great Again

We at Ponicode believe that there are many interesting things in the literature of mathematics and computer science. Among these things, many are today (wrongly) reduced to theoretical concepts that have no practical use. We strongly believe that having a fresh eye on these concepts helps us and will continue to help us create life-changing tools for developers!


[1] Chatterjee, Amita. “Naturalism in classical Indian philosophy.” (2012).
[2] Moor, James, ed. The Turing test: the elusive standard of artificial intelligence. Vol. 30. Springer Science & Business Media, 2003.
[3] Turing, Alan M. “Computing machinery and intelligence.” Parsing the Turing Test. Springer, Dordrecht, 2009. 23-65.
[4] Harnad, Stevan. “The annotation game: On Turing (1950) on computing, machinery, and intelligence.” The Turing test sourcebook: philosophical and methodological issues in the quest for the thinking computer. Kluwer, 2006.
[5] Searle, John R. “Minds, brains, and programs.” Behavioral and brain sciences 3.3 (1980): 417-424.
[6] Roller, Stephen, et al. “Recipes for building an open-domain chatbot.” arXiv preprint arXiv:2004.13637 (2020).
[7] Tim, BEAUFILS, B., Bill, M., Répondre, Marc, Laurent, & Hadrien. (n.d.). Non, le test de Turing n’est pas passé ! Retrieved June 02, 2020, from Scilogs – Le test de Turing
[8] Le test de Turing | Intelligence Artificielle 3 (ft … (n.d.). Retrieved June 2, 2020, from Youtube (Science4All) – Le test de Turing
[9] Unit testing. (2020, May 11). Retrieved June 02, 2020, from Wikipedia – Unit testing
[11] Aouad, E. (2020, May 22). Cover rage, when code quality matters. Retrieved June 02, 2020, from Ponicode Blog – Cover rage, when code quality matters
[12] R/ProgrammerHumor – Different ways of making an AI chatbot. (n.d.). Retrieved June 03, 2020, from Reddit – Different ways of making an AI chatbot

Green blobred blob