Transformers: Beyond the Seventh Art
Are you confused between Ponicode and CircleCI? It’s not you, it’s us. Ponicode was acquired by CircleCI as of March 2022. The content and material published prior to this date remains under Ponicode’s name. When in doubt: Ponicode = CircleCI.
LSTM have always been the state of the art models used for sequential data such as text, well at least they were, until the transformers - thanks to the BERT revolution - took over.
Transformers use the principle of attention to obtain the current impressive performance. Attention has been around for a long time: it is used to improve the performance of LSTM networks, and gives some kind of interpretability by showcasing the elements of the sequence responsible for the given result.
Let’s start by illustrating how a vanilla recurrent network works in the case of translation.
The prediction of the tokens is done sequentially: the token “allez” is obtained from the token “comment” and from the hidden state of the previous recurrent block.
Attention is an interface that adds information to the prediction of the word “allez”. In this case, we allow the recurrent block to fetch information from the input sequence, and use it to compute the next hidden state.
The connection between the output token and the input sequence is a simple softmax connection: the network is trained to keep the information from the relevant input tokens only.
The attention mechanism has been widely used to increase the performance of both CNN and LSTM networks. Actually, it was so powerful that researchers found out that it was sufficient for dealing with sequential data.
The transformer architecture has first been introduced in the paper Attention is all you need. It uses the principle of self attention. This has been a huge breakthrough: accelerating LSTM training with GPUs is not as efficient as feed-forward layers.
We will later on deep dive into the architecture of this neural network.
Transformer wow effect
One of the best known transformers is OpenAI’s GPT-3 model - and for good reason. Its impressive performance shows all the power and potential of the attention mechanism. In this part of the article, we will focus on some fascinating use cases of this model.
✏️ Text generation ✏️
One of the things that impressed me the most about GPT-3 was its incredible ability to complete a text from the beginning in a language that is grammatically near-perfect and semantics consistent enough to amaze.
An article in The Guardian, for example, was completely written by GPT-3. Fed with this introduction:
“I am not a human. I am Artificial Intelligence. Many people think I am a threat to humanity. Stephen Hawking has warned that AI could ‘spell the end of the human race.”,
the neural network wrote a complete article which you can read here.
Candy Japan also generated a complete presentation by initialising the neural network with the phrase:
“A presentation was given at the Hacker News event in Kansai. The title of the presentation was”
After that, GPT-3 generated the presentation that was formatted in the following video:
The process is described in the article available on Candy’s blog.
🔄 Analogies 🔄
This shows that GPT-3 has a robust and consistent language model. And this is in line with what we have seen in the NLP community in recent years. What is disturbing is that it appears that this abstract model also abstracts logical operations. This aspect of things has been explored quite extensively through analogies. We initialise the model with a list of analogies and check whether it can capture the nature of these analogies.
One of the simplest examples was studied by Melanie Mitchell on her blog. She checked if GPT-3 was able to understand analogies such as “If abc becomes abd then what would become pqr? GPT-3 then guessed the answer “pqs”. In the article, we see that GPT-3 manages to capture analogies in a rather impressive way.
Other people have applied this same concept of analogy to other more complicated use cases. For example, by giving GPT-3 examples of legal language simplification, he is able to understand the task and continue with it for new legal texts. Similarly, by giving examples of reformulating sentences to make them more polite, GPT-3 fully understands the task and does so successfully.
➗ Arithmetics ✖️
All of this is unbelievable. But what intrigued us the most at Ponicode was the fact that GPT-3 succeeds (albeit moderately) in understanding arithmetic operations. This is because we considered these tasks outside the semantic scope. Indeed, it is above all a problem of representation. The representation of numbers (considered as syntactic tokens with no order relation) must be extremely precise for arithmetic operations to be included in the model. A blog article by John Faben explores this issue and shows the performances and limitations of GPT-3 on arithmetic issues.
In this section, we briefly introduce the base transformer architecture.
🤖 The model 🤖
A detailed implementation can be found here. The architecture chosen in the paper consists of stacking 6 encoders and 6 decoders.
Each encoder consists of:
- a self attention layer
- a normalisation layer
- feed forward layers
- a normalisation layer
The decoder structure is rather similar to the encoder, with an additional layer of encoder-decoder attention with layer normalisation.
🦾 The performance 🦾
The transformer has been tested on various tasks in NLP. The paper showcases the results on the English to German translation.
The transformer achieves better results than SoTA methods of the time, it also is one order of magnitude cheaper to compute in average.
After the advent of the transformers, one of the biggest breakthroughs in natural language processing is BERT. In this section, we detail how it works.
🤖 The model 🤖
It is a multi-purpose language model, in the sense that it extended the capabilities of the transformer from language translation to sentence classification, question answering, language inference…
BERT is a multi-layer bidirectional Transformer encoder. The base version consists of 12 encoder-decoders and feed forward layers of size 768. It was pre-trained on two tasks.
BERT is multidirectional because it was trained by putting masks on the training data, and by prediction the missing word. One could argue that it would be more accurate to say that it is non-directional instead. The second step of the pre-training consists of giving the network pairs of sentences A and B, and letting it classify whether sentence B can follow sentence A.
Being pre-trained, BERT allows us to use transfer learning, thus applying it to different tasks by changing the last layer of the network and the tokens chosen for the input. Here are some examples extracted from the original paper.
🦾 The performance 🦾
BERT demonstrated its performance on the GLUE benchmark, and has shown a gap of performance compared to previous models in NLP.
Bert has become the standard for various tasks in NLP. The usage of powerful networks like BERT, or its variations such as RoBERTa or ALBERT, has become really easy for inference thanks to the huge work of the community.
The Python package transformers, from Hugging Face, is one of the most popular libraries in NLP.
Transformers with Python
In this part of the article, we will introduce a library that makes the use of transformers very accessible. It is the transformers library by Hugging Face. It is a Python library that abstracts the complexity of transformers in simple objects. We are going to focus on the pipeline object that allows us to perform classical NLP tasks from end to end. This object will allow you to explore the potential of transformers by yourself.
To start, you need to install the transformers library. The easiest way is to go through pip while doing:
pip install transformers
Then, create a file transform_test.py, import pipeline of transformers by doing:
from transformers import pipeline
Masked language modeling is a well-known NLP task. It involves masking words in a text and training a model to predict them. To try such a pipeline with the transformers library, just add to our file the following lines:
The token <mask> replaces the (only one) word hidden from the sentence. The output object is the following:</mask>
We can see that the model gave us several suggestions. In each suggestion, the sentence key represents the completed sentence, the score key represents a probability with respect to the proposed word, and the token key represents the token index in the tokenizer used.
Moreover, in this classical pipeline, the model used is RobertaForMaskedLM and the tokenizer is RobertaTokenizer. It is possible to change them by initialising the pipeline object with the model and tokenizer parameters, but this will probably be the subject of a future article.
Text summarization is one of the most important tasks in NLP. It is a matter of transforming a long text into a short text by keeping only the key information. This can be done in two ways: either by extracting the key sentences or by generating a summary from scratch. Transformers are generative, not extractive.
Now let’s test this task with the transformers library. Just add the following lines to our file:
The max_length parameter allows to set a maximum length to the text in terms of the number of tokens. There is also a min_length parameter. We get in output the object:
By setting max_length to 20, we get:
The default summarization pipeline uses the BART tokenizer and its model. This can be changed using the model and tokenizer parameters.
Text generation is also a hot topic. It is a matter of generating the continuation from the beginning of a sentence or text. To do this, add the following lines to our file:
The max_length parameter allows us to generate at most 100 tokens. Thus, we get the following generated text:
By setting max_length to 50, we obtain:
As you can see, transformers love Ponicode! And that love is reciprocal.
At Ponicode, we are very aware of the potential of transformers and do a lot of research on the subject. Moreover, we follow publications that deal with this subject very closely. We apply this concept of attention mechanisms to code-type data in order to solve problems related to the software development process. If you are interested in the subject of machine learning applied to code, read our article on the subject.