Our Rap Lyrics Generator for Github Satellite 2020: A brief AI tutorial
Are you confused between Ponicode and CircleCI? It’s not you, it’s us. Ponicode was acquired by CircleCI as of March 2022. The content and material published prior to this date remains under Ponicode’s name. When in doubt: Ponicode = CircleCI.
AI text generation made easy for anyone
“AI is so dangerous!”, “Robots will soon conquer the world!”, “Have you heard about GPT-2? Too dangerous for the public.”
Warning: Today I’d like to shock you to the core with the latest AI weapon and send shivers down your spine by, wait for it, generating RAP Lyrics!
We debuted this AI last week at Github Satellite in a Rap Battle for the history books, putting the rap lyrics generator Ponybot against a real human rapper. In case you missed it, you can watch the rap battle here: https://blog.ponicode.com/2020/05/11/watch-the-ponicode-ai-rap-battle-at-github-satellite-2020/
You can train your own AI rapper online by following the Colab Link
AI may be a cause for optimism but also for concern as people develop new use cases. As you may have heard, a “dangerous” AI weapon of propaganda, “able to generate fake news” was released by Open AI in February. This AI is able to generate highly realistic texts (papers, music, books, …). Its name is GPT-2.
Some are afraid of it. Others feel like they have a superpower when it’s in their hands.
In this tutorial, I’d like you to play superman with me and use GPT-2 to rap like my favorite French rapper: Booba.
Spoiler alert in case you were wondering: Larry, the amazing human who went up against the Ponybot, won the rap battle at Github Satellite, but we are totally on-board with that. After all, here at Ponicode we believe that the best AI will unleash the creativity and human ingenuity of developers.
In this article I’ll guide you step by step to train your first GPT-2 machine gun on Google Colab. I’ll write another article to show you how to train on an AWS EC2 and backup the model on S3.
If you don’t have any coding experience or you don’t want to spend time collecting Booba’s lyrics but still test the GPT2 model, I advise you to use Shakespeare’s texts instead since I can’t give you Booba’s lyrics. You can download Shakespeare’s texts here Github (Karpathy) – Shakespeare’s text
Copy the text, save it into a file and name it Shakespeare.txt Credit goes to https://github.com/karpathy/char-rnn for Shakespeare data.
A- Data collecting
- Where to get the data?
- How to collect it?
B- Data processing
- French rap subtleties
- Text processing
C- Train on Google Colab
- Create your first Colab notebook
- Upload your file
A- Data collecting
1. Where to get the data?
Since I have a lot of Booba’s albums, I managed to put his lyrics into a Booba_lyrics.txt file that I will not share for obvious legal reasons.
I prefer the second one, it's easier to “read”.
2. How to collect the data?
You can use websites APIs and use well known Python libraries like requests, BeautifulSoup and so on.
B- Data processing
1.French rap subtleties
French rap, as in any language, is not the easiest kind of text. Take these sentences for example:
“J’préfère mourir à se-l’ai que vivre en galérien”. (Paradis, Booba)
“Therapyzi neuf-deux Izi” (Paname, Booba)
“Cent moins ocho dans le building” (Paname, Booba)
“Tout, tout pour la monnaie brrr, brrr, billets verts” (Billets Verts, Booba)
“Tu dis partout qu’t’es une terreur, chez nous on t’trouve super sympa” (Double Poney, Booba)
The way French rap songs are structured compared to the pretraining of GPT-2 adds complexity:
- Obviously, French rap is in French and GPT-2 has mostly been pre-trained on English corpuses. This English DNA of GPT-2 will leave traces on the predictions.
- French rap has verlan. This form of slang, popular among young people, involves inverting the syllables of words.
- In some of his lyrics sentences, Booba uses his favourite words in a more rhythmic logic rather than language logic (“Therapyzi neuf-deux Izi”, “brrr, brrr”)
- Booba mixes languages in his sentences (French, Spanish, English, Arabic, …)
- The very large use of contractions (“on t’trouve”, “J’préfère”, “qu’t’es une terreur”)
- Rap is structured by rhymes, tempo, syllables, rhythm and unconventional grammar rules.
There are fewer rules on rap grammar than natural language grammar. These subtleties will reduce the quality of the predictions.
2. Text processing
- Remove other rappers’ lyrics: Booba did a lot of “featurings” with other rappers. To stick to Booba’s style I have removed all the lyrics of the other rappers in his songs
- Add “starts” and “ends” between his songs: Every song is different and has its own style. I added ‘<|startoftext|>’ and ‘<|endoftext|>’ at the beginning and end of each song so the AI can switch style while rapping.</|endoftext|></|startoftext|>
Add a space before commas and after apostrophes:
“Tu dis partout qu’t’es une terreur” becomes “Tu dis partout qu’ t’ es une terreur”
“Tout, tout pour la monnaie brrr,brrr, billets verts”
becomes “Tout , tout pour la monnaie brrr , brrr , billets verts”
[sentence.replace("'", "' ").replace(''', '' ').replace(',', ' ,') for sentence in lyrics]
- Divide by syllables: To shape your data in a way that it represents the concept of rap you can structure by rhymes, tempo, syllables, rhythm, … I have just divided by syllables for this example. You can dig further … You can do this by using pyphen (It is not perfect but works well enough)
import pyphendic = pyphen.Pyphen(lang='fr')
lyrics = [dic.inserted(sentence, hyphen='|| ') for sentence in lyrics]
Save the data into a .txt file. My data looks like this
Divided by syllables. The two screenshots are not from the same song but each song follows this logic. I had to cut each part because of swearwords:
Not divided by syllables (better performance). The two screenshots are not from the same song but each song follow this logic. I had to cut each part because of the rude words:
C- Train on Google Colab
You don’t have to be an expert data scientist for this part. Anyone can do it. Just follow the tutorial step by step.
I have broken it down into many very simple points so anyone can do it. Each point takes a few seconds. The whole thing takes less than 5 minutes excluding training time.
If you’ve never used colab, follow the whole tutorial.
If you know how to use colab, go to step 8.
Go to google colab: https://colab.research.google.com
1- Click on File
2- Click on New Python 3 notebook
3- Click on ok
4- Enter your google email address
5- Click Next and enter your Password
6- You are on your first collab notebook. Click on No thanks in the blue window.
7- You can rename your notebook at the top left. I renamed it rap-like-booba.ipynb
8- Click on runtime
9- Click on Change runtime type
10- Click on the down arrow next to None
11- Click on GPU
12 — Click on save
13-Copy these lines:
!pip install -q gpt-2-simpleimport gpt_2_simple as gpt2
Click on the first rectangular cell and paste the lines
14- Click on shift and enter simultaneously. Congratulations, you have imported the library to train the model
15- Click on the little right arrow next to your first cell (I put a red circle around it)
16- Click on Files (red circle)
17- Click on UPLOAD (red circle)
18- Click on ok
19- If you managed to collect Booba’s lyrics, name your file Booba_lyrics.txt If you don’t have any coding experience or you don’t want to spend time collecting Booba’s lyrics but still test the GPT-2 model, I suggest you use Shakespeare’s texts instead since I can’t give you Booba’s lyrics. You can download Shakespeare’s texts here https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyShakespeare/input.txt
Copy the text and save and name it Shakespeare.txt
Credit goes to https://github.com/karpathy/char-rnn for Shakespeare data.
Choose your file Booba_lyrics.txt or Shakespeare.txt and upload it
20- In the second rectangular cell, paste these lines and press shift and enter:
21- Your models are now downloaded, you have to train them with your data now!
On the third rectangular cell, copy and paste these lines:
If you have Shakespeare’s data:
sess = gpt2.start_tf_sess()gpt2.finetune(sess, dataset='Shakespeare.txt', model_name='124M', steps=2000, restore_from='fresh', run_name='rap',print_every=10, sample_every=200, save_every=500)
If you have Booba’s data:
sess = gpt2.start_tf_sess()gpt2.finetune(sess, dataset='Booba_lyrics.txt', model_name='124M', steps=2000, restore_from='fresh', run_name='rap', print_every=10, sample_every=200, save_every=500)
22- Click on shift and enter. Your model is training now! Soon it will be able to rap!
My file has about 230K words and the model takes 2 hours to train. If you want to cut it to 1 hour, just change this line
The quality of the prediction will be much lower though if you do that.
During training, your model will rap every 200 steps.
At the beginning, the quality of the prediction is very low because your AI is still learning (rude words are crossed out for the sake of professionalism):
In the end, the predictions are far better. Here are some raps while still learning:
Sometimes, it raps exactly the same as Booba: That’s overfitting
23- Learning is done, now you can make your BoobaBot rap!
Copy paste this line and tap shift enter
Congratulations, you have created your first Booba-AI!
Look at the class punchline completely created by the Booba-AI:
“Aucune attache, grosse bague en poche”
Thanks for following the tutorial! Now that you’ve mastered rap song generation, keep in mind that great power comes with great responsibility. Try your best not to hurt anyone.
@Booba thank you for helping me with this tutorial. You can contact me at firstname.lastname@example.org if you want me to generate lyrics for you. Izi.
Edmond Aouad, Data Scientist at Ponicode