Word2vec : questionning the dataset

following my last research about Word2vec: https://blog.felixjely.fr/2021/03/17/word2vec-new-production/

I wanted to go above the dataset trained on wikipedia and goes futher in more vernacular language. Thus, I initiated the creation of an accumulation of multiple french rap lyrics. I used the Genius Api giving me multiple songs form a list of artist :

artists = [‘Kaaris’, “Booba”, “PNL”, “Lacrim”, “Plk”, “Alpha wann”, “Rohff”, “Koba lad”, “Spri noir”, “Hamza”, “Kalash criminel”, “Gazo”, “Freeze corleone”, “Vald”, “Heuss lenfoire”, “Damso”, “Josman”, “Supreme ntm”, “Iam”, “Oxmo puccino”, “Disiz la peste”, “Guizmo”, “Zola”, “MZ”, “Nekfeu”, “Orelsan”, “SCH”, “Caballero and jeanjass”, “Hornet La frappe”, “Laylow”, ‘Ninho’, “lartiste”, “Niro”, “Bosh”, “Hatik”, “Soprano”, “Maitre Gims”, “leto”, “Youssoupha”, “Médine”, “Dadju”, “Aya Nakamura”, “Diams”, “13 block”, “Maes”, “Jul”, “Niska”, “MHD”, “La fouine”, “sefyu”, “Rimk”, “113”, “3010”, “Dosseh”, “Gradur”, “Dinos”, “Doums”, “Georgio”, “Jazzy Bazz”, “Jok’air”, “Kekra”, “Kery James”, “Ideal J”, “Mc solaar”, “Mister V”, “Népal”, “Nemir”, “Sneazzy”]

It gave me a txt file containing around 50 to 20 songs from each artist :

Download

The main idea was to give the more “talking” language than the encyclopedia-ish form produce by the wiki’s model.

It could be used to train a specific model designed to be looking through “argotique” expression, or be used on top of other data to give plural approach of words.