AI on Code: What's New in 2022?
Are you confused between Ponicode and CircleCI? It’s not you, it’s us. Ponicode was acquired by CircleCI as of March 2022. The content and material published prior to this date remains under Ponicode’s name. When in doubt: Ponicode = CircleCI.
Here at CircleCI France, we’re excited about all things code, and all things AI. So we make sure to be on the lookout for the hottest topics in Code Intelligence to leverage in order to improve developers’ coding experience (for those of you who have followed our 6 game changing AI-on-code papers published in 2021 that you should know about piece, surely know). And so far, 2022 has witnessed the surge of many creative advancements in the field as well as major trend setters, which you should definitely keep close tabs on.
AI Taking On Coding Competitions
Thanks to the leap in both scale and complexity of Natural Language Models (NLP), more and more impressive challenges in the field of code generation have been overcome lately. One that has remained elusive however, is the domain of programming competition, which requires robust problem-solving skills and algorithmic knowledge beyond writing correct code .
Deepmind’s latest creation, AlphaCode, is a competitive code AI which leverages an encoder-decoder transformer architecture pre-trained on open source code to solve competitive programming challenges from natural language prompts. (These are the guys behind the famous AlphaGo, the first machine to beat a professional Go player, by the way). The current version placed in the top 54.3% on average in sample competitions with more than 5,000 participants. Impressive 🤯
The Year Of The Transformer
AlphaCode isn’t the only recent model to make use of a transformer architecture. Domain experts agree: when it comes to NLP, 2022 is definitely the transformer year. OpenAI’s GPT-3 (for generative pretrained transformer) is arguably the world’s most powerful text generation tool right now. Whereas Google’s ever so popular BERT and its countless variations continue to dominate the leaderboard of state-of-the art benchmarks such as the GLUE and SuperGLUE challenges (In fact, a code version called CodeXGLUE exists where BERTs are setting records as well).
When it comes to AI on code, models based on the T5 pre-trained architecture have been faring fairly well, winning various NL-PL, PL-NL, and PL-PL tasks (for Programming Language and Natural Language) such as code summarization, defect and clone detection, code generation, and debugging. CoText and CodeT5 are two such performers. And our favorite is TFix, a new code correction solution which involves context information and error messages to reach state-of-the art bug fix performances.
If you need a brush-up on transformers, no worries. Our blog post on the topic is just what you need 💪
Code Intelligence Going International
Speaking of benchmarks. The Code Natural Language challenge (CoNaLa for short) is one such dataset which has been popular for some time now for testing code snippet generation from natural language. It is composed of scrapped StackOverflow topics and corresponding code snippet solutions, hand-curated for filtering.
MCoNaLa (the M stands for multiple) is its smaller brother whom researchers have developed this year in an effort to make NLP and code AI more universal and address the lack of non-english resources. It is inspired by the same principle of processing StackOverflow data with the help of experts, applied to Chinese, Spanish and Russian forums. And goes further by using translations in the preprocessing of data and/or outputs to augment the model with the help of the original CoNaLa dataset.
While the results obtained were slightly underwhelming for what the writers hoped, the paper goes on to show the challenges of applying language models to multilingual code problems. And the importance of such models to take the scope of Code Intelligence beyond its current restriction to English, especially given the abundance of code resources available in many foreign languages.
Grammar Is The New Syntax
Last year, we were all about Abstract Syntax Trees and presented work that used the parsing of code to obtain better representations to fuel NLP architectures. Indeed, as Programming Language is way more syntax-sensitive than Natural Language, relying on formal representations provides just the boost that we needed to make models capture code structure.
For this reason, we found a very smart approach by “Analysis of Tree-Structured Architectures for Code Generation”, which studies the use of Natural Language parsing trees to complement baseline code generation models. Using linguistic rules to capture the hierarchical structure of text inputs as a tree. They show improvement in consistency over two code generation benchmarks using tree-to-seq (language tree to code) and tree-to-tree (language tree to AST) models.
“An example of a generated tree using language parsing rules”
The most elegant approach we’ve been loving so far however is the one adopted by the team behind The “impact of lexical and grammatical processing on generating code from natural language”. They leverage language parsing rules and techniques to generate AST outputs from a seq2seq model, which are then transformed into Python code.
That’s it for 2022, we hope this gives you a global perspective on innovations in terms of AI applied to Code Intelligence and that you are as thrilled as we are to follow the evolution of this field.
As developers’ partners, we believe in augmenting their capabilities thanks to AI and this is what we are trying to achieve through our solutions, don’t hesitate to give it a shot! Also, data scientist and ML engineer friendly 😉
📚 Some resources you might find useful:
- One of our latest AI Meetup was on Transformer Models & Architectures, watch the replay 🎥 (French speakers only 🇫🇷)
- Code Quality is not just for devs, we got into the deep with this article: When data scientists should perform tests