Machine Learning on Source Code
Software has become essential to our lives. Millions of lines of code are being written and executed every day all around the world. The economy in its entirety relies at an increasing rate on IT. Developers therefore are more responsible than ever for their impact on society. Producing high-quality software with current time constraints means development tools are key factors to the success of the tech industry. Many of these tools have been developed based on antiquated academic research, which has been—for most of our industry’s existence—articulated around a deductive-logic approach.. We have recently witnessed a shift toward statistics-based research. The purpose of this article is to summarize—in a non-exhaustive way—the different aspects driving this trend.
Why AI on code?
Today, with the generalization of open-source software, everyone can have access to not only an astronomical quantity of lines of code, but also metadata related to the programming profession: comments, bug fixes, reviews, timelines, etc. All this data, particularly when coupled with the rise of machine learning, can only be seen as a tremendous opportunity for new tools.
Beyond the gigantic amount of data, code, in addition to being a means of getting a computer to perform tasks, also has a communicative function. As developers’ roles have become increasingly collaborative, more and more code must be readable by others than its author (or by the author some time after its creation). Thus, developers are required to respect naming and readability constraints. All this makes code look more like natural language than ever before—at least in one of its dimensions. However, statistical methods on natural language have now proved their worth on a lot of tasks that cannot be automated by a programmatic approach.
Code is still—in addition to increasingly fulfilling a communicative function with other developers—essentially written to be read and understood by a computer. Moreover, a programming language is created and influenced by a very limited number of people, unlike human language, which is the result of thousands of years of human interaction. A programming language is constructed in a mathematical way, and is therefore extremely formal. This is why it has a very constrained syntax compared to human language. With almost unlimited data, the problem of statistically modeling code is intuitively much less complex (in terms of dimensionality) than that of Natural Language Processing (NLP).
Source Code vs. Natural Language
A 2018 publication draws a parallel between ML on code and NLP by separating the search fields as follows:
– Code-generating models, which aim at generating code. They can be seen as models that generate text such as autocomplete, automatic translators, chatbots, etc.
– Code representation models, which allow one to represent specific properties of code in order to characterize it. These models can be related to sentiment analysis, named-entity recognition (NER) or text classification.
– Pattern-mining models, whose goal is to find—in an unsupervised way—repetitive and user-interpretable structures in code. They can be compared to topic modeling or structured information mining in text.
In each of these three categories, we will list academic articles that we think are interesting to share. We have decided to only share publications that are freely and openly available on arXiv.
Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code 
Since in code, developers can name variables as they wish, theoretically, we have an infinite potential vocabulary. This publication builds a language model with an open vocabulary by considering each token as a set of subtokens. They also evaluate their model in different tasks, noting that it often surpasses the state of the art.
Unsupervised Translation of Programming Languages 
This paper from Facebook made a lot of noise. It implements an unsupervised trained language code translator. On the function-translation task, evaluated using unit tests, this model outperforms current commercial solutions quite significantly.
code2seq: Generating Sequences from Structured Representations of Code 
Based on paths in an AST (abstract syntax tree) and an attention mechanism, this publication proposes a model whose task is to summarize source code. An online demo is available at: https://code2seq.org.
Code representation models
code2vec: Learning Distributed Representations of Code 
code2vec is a way to represent the source code using paths inside an AST. When these paths are considered as a context to drive embeddings, this method provides results 75% better than the state of the art in function-name prediction. An online demonstration on different tasks is available at: https://code2vec.org.
Learning to Represent Programs with Graphs 
Beyond an abstract syntax tree, this publication proposes a more complete way to represent code based on a graph. This method enriches an AST with edges representing relationships between different tokens (Child for the classical relationship in an AST, ReturnsTo to link the return to the function declaration, LastLexicalUse to link different uses of the same identifier, etc.). This representation is then exploited to name variables and to detect false uses of variables.
Contrastive Code Representation Learning 
This publication presents a way of representing code that is independent of tasks. Indeed, this representation is driven in an unsupervised way. Thus, it can feed any model for any task on code. The results obtained thanks to this on the code-summarization task are superior to the state of the art. The same goes for type prediction.
Mining Idioms from Source Code 
This paper presents Haggis, a system for mining idioms in the source code. An idiom is a succession of lines of code (syntax) that is often repeated and always has the same purpose (semantics). This system is based on the nonparametric Bayesian probabilistic tree substitution grammars. The model is evaluated on a large amount of open-source code. The result is that many of the detected idioms are also present on Q&A websites for programmers and that the most recurrent detected idioms represent real concepts in code, such as initializing a class instance or catching an error.
Parameter-Free Probabilistic API Mining across GitHub 
PAM (Probabilistic API Miner) is presented in this publication in order to detect the most interesting API call patterns. Then, this model is used to see if the most likely API call successions are well-documented—spoiler alert—the answer is no. This tool can therefore be useful for documentation, among other things.
Topic modeling of public repositories at scale using names in source code 
This paper applies topic modeling to a corpus of 18M open GitHub repositories in order to derive the most recurrent topics and tokens that characterize them. The publication lists a number of topics such as scientific and social concepts, languages, programming languages, computer science, technologies, games, etc.
Ultimately, there is a real trend toward the application of statistical methods to source code in the academic world. Many researchers around the world are working on this—considering code as data, creating new tools for developers or improving old tools. Statistics allow us to solve complex problems that are in practice unsolvable by the traditional deductive-logic approach. This trend has only just begun and we bet that there is no AI winter ahead anytime soon for this important work, with all of the challenges that it can clearly address—today and in the near future. Ponicode is part of this movement and intends to take advantage of the mathematical, technical revolution of AI to transform developers’ daily lives—and the lives of those they touch.