Extending the Massive Text Embedding Benchmark to French: the datasets
Mathieu Ciancone, Imene Kerboua, Marion Schaeffer, Gabriel Sequeira, and Wissam Siblini
Introduction
With the recent boom of natural language applications, the ability to select a method that generates high-quality text representations has become crucial. To help with this, the Massive Text Embedding Benchmark (MTEB) [1] was introduced. It allows the evaluation and comparison of text embeddings methods on various NLP tasks and datasets. An embedding is a dense vector representation that captures the semantic meaning of a text and can be used for downstream NLP tasks such as text classification, information retrieval, machine translation, etc. MTEB originally compared 33 different models on 8 different tasks : bitext mining, classification, pair classification, retrieval, reranking, clustering, summarization and semantic textual similarity. Overall, it gathered 58 datasets across tasks, most of them in English.
We extend this work to the French language. The project is available here 👉 https://github.com/Lyon-NLP/mteb-french. In order to compare embeddings obtained from texts in French, we identified 14 relevant datasets and created 3 new ones targeting the set of tasks used in MTEB. Of course, these datasets can also be used for a wide range of other applications.
This first article of a series focuses on presenting these datasets, their characteristics and the goal behind each task. By bringing all the information together in one place, we hope to make it easier to search for French datasets for NLP and encourage the evaluation of French embeddings.
So if you are building an NLP model or application targeting the French language, this article is for you! 😉
Datasets
In MTEB, the evaluation is divided into the 8 distinct tasks mentioned above. We will present the datasets according to these 8 tasks, but keep in mind that a given dataset could potentially be used in other ways.
Bitext Mining
Given two sets of sentences, this task aims to find the best match for each sentence from the first set in the second set. Generally, the second set contains translations of the sentences from the first set. For MTEB evaluation, models are used to embed each sentence and then the closest pairs are found using cosine similarity. The main metric computed for this evaluation is the F1 score.
Diabla: We use the DiaBLa dataset, already available on HuggingFace. This dataset contains a set of informal written dialogues for evaluating English-French machine translation on informal texts [2]. The dataset contains over 5700 text pairs extracted from 144 dialogues.
Link to dataset: https://huggingface.co/datasets/rbawden/DiaBLa
Flores: Flores is a benchmark dataset for machine translation between English and low-resource languages [3,4,5]. We use the French subset of this dataset, in other words, the English-to-French translated texts. It contains approximately 997 samples.
Link to dataset: https://huggingface.co/datasets/facebook/flores
Classification
For the classification task, we evaluate which embedding models are best suited for the task of identifying to which class a sentence belongs based on its vector representation. To that end, a model is used to embed a train and a test set. Then, a logistic regression classifier is trained on the train set and evaluated on the test set with the Accuracy metric.
Amazon Review: We use the French subset of Amazon Review. The dataset contains 200k samples in the train set, and 5K samples both in the validation and test sets. The text of the product reviews is classified according to the associated rating between 0 and 4.
Link to dataset: https://huggingface.co/datasets/mteb/amazon_reviews_multi/viewer/fr
MasakhaNEWS: MasakhaNEWS is a public dataset for news topic classification, containing news over 16 languages spoken in Africa [6]. We use the French subset of this dataset, it contains over 1480 samples in the train set, 211 in the validation set, and 422 samples in the test set. The dataset provides each text with its label (among 5 distinct classes: sports, business, etc.). Samples are equally distributed over 4 classes (23% of the samples each), but the last class is under-represented (5% of the samples).
Link to dataset: https://huggingface.co/datasets/masakhane/masakhanews/viewer/fra
Massive Intent: We use the French subset of the Amazon Massive Intent dataset [7,8]. The dataset is about detecting a user's intent from a given sentence. The data was collected from virtual assistants usage such as Alexa. It contains over 11,500 samples in the train set, 2030 samples in the validation set, and 2970 samples in the test set.
Link to dataset: https://huggingface.co/datasets/mteb/amazon_massive_intent/viewer/fr
MTOP: As above, we selected a subset of the dataset Multilingual Task-Oriented Semantic Parsing [9]. It contains over 11,800 samples in the train set, 1580 samples in the validation set, and 3190 samples in the test set. The dataset is duplicated, one copy with 10 labels (mtob_domain) and the other with 111 labels (mtob_intent).
Links to datasets: https://huggingface.co/datasets/mteb/mtop_intent, https://huggingface.co/datasets/mteb/mtop_domain
Pair classification
In this task, a pair of sentences is given with an additional label denoting if the pair is a duplicate or a paraphrase. Both sentences are embedded using a model and the distance between them is computed using several distance metrics such as cosine similarity, euclidean distance, etc. The evaluation metric for this task is average precision based on the cosine similarity.
Opusparcus: This dataset is a paraphrase corpus for six languages, where the paraphrases are subtitles from movies and TV shows [10]. We evaluate embeddings on the test.full and validation.full splits of the French dataset French subset. Each part contains respectively 1670 samples and 1630 samples.
Link to dataset: https://huggingface.co/datasets/GEM/opusparcus/viewer/fr.100
Retrieval
Considering a query, the retrieval task aims to find the most relevant documents (often paragraphs) among a corpus of documents, using cosine-similarity between vectors. The benchmarking of models regarding this task is particularly interesting for their further implication in Retrieval Augmented Generation (RAG) pipelines. Several metrics are used to evaluate this task, the main one being Normalized Discounted Cumulative Gain (NDCG@10).
AlloProf: This question-answering dataset is collected from Alloprof, a Quebec-based primary and high-school help website. It contains almost 30k question-answer pairs, spread over a variety of school subjects [11]. More than half of the answers also contain a link to a reference page addressing the question’s topic. The dataset has been cleaned and formatted for the retrieval task: only French questions mentioning a reference page were kept, and all reference pages have been consolidated as a corpus dataset.
Links to datasets : https://huggingface.co/datasets/lyon-nlp/alloprof (formatted), https://huggingface.co/datasets/antoinelb7/alloprof (original)
Syntec: This dataset has been built from the Syntec collective bargaining agreement, with the initial purpose of being used for the retrieval task. This rather small dataset is split into 2 subsets: 100 manually created questions mapped to the article which contains the answer, and the 90 articles from the collective bargaining agreement.
Link to dataset: https://huggingface.co/datasets/lyon-nlp/mteb-fr-retrieval-syntec-s2p
BSARD: The Belgian Statutory Article Retrieval Dataset (BSARD) is a French native dataset initially built for information retrieval in the legal domain [12]. It consists of 1,100 legal questions posed by Belgian citizens and labeled by experienced jurists with relevant articles among more than 22,600 statutory articles from Belgian law. It is a particularly difficult dataset for models that are not specialized in the legal domain.
Link to dataset: https://huggingface.co/datasets/maastrichtlawtech/bsard
Reranking
The goal of the reranking task is to sort a small set of documents in terms of relevance regarding a given query. The reranking task is often used in recommender systems, or as a complement to the retrieval task. In the context of MTEB, the aim is to evaluate the models' ability to produce embeddings that yield a cosine similarity correlated with the document's relevance to the question.
To evaluate this task, each dataset of the original MTEB benchmark is composed of a query, paired with a few positive (i.e. relevant) documents along with negative (i.e. irrelevant) documents. Despite our efforts, we didn’t find any relevant French dataset structured like so. Consequently, we decided to build our own using the AlloProf and the Syntec retrieval datasets. These already have queries and positive documents so we applied the following process to generate the negatives. The corpus of documents and queries has been embedded using an embedding model. Then, we computed the cosine similarity between every document and query. Documents that are not in the top 10 similarity for a query were labeled as its negative documents.
AlloProf: We adapted the Alloprof dataset with the technique presented above to fit the reranking task. For more information about this dataset, please refer to its description in the Retrieval section.
Link to dataset: https://huggingface.co/datasets/lyon-nlp/mteb-fr-reranking-alloprof-s2p
Syntec: As described above, this dataset’s structure has been modified to fit the reranking task. For more information about this dataset, please refer to its description in the Retrieval section.
Link to dataset : https://huggingface.co/datasets/lyon-nlp/mteb-fr-reranking-syntec-s2p
Clustering
This task tries to group sentences or paragraphs into meaningful clusters. To do that, the texts are embedded and a k-means model is applied with the known number of clusters.
The metric used to score the model is the v-measure, which does not depend on the cluster label.
AlloProf: For this task, documents can be clustered into different topics based on their textual description (fields text and title). For more information about this dataset, please refer to its description in the Retrieval section.
Link to dataset: https://huggingface.co/datasets/lyon-nlp/alloprof
HAL: This dataset was built by scrapping https://hal.science/ where scientific publications in various fields are published. We only kept publications in French and extracted their id, title and domain. In total, 85,000 publications can be clustered according to their subject area (field domain) by their title.
Link to dataset: https://huggingface.co/datasets/lyon-nlp/clustering-hal-s2s
MasakhaNEWS: We reuse this multilingual news topic classification dataset for the clustering. We filtered the dataset to only keep the French subset of the test set. In total, 1500 news can be clustered according to their topic (field label) by their textual description (fields text and headline). For more information about this dataset, please refer to its description in the Classification section. Link to dataset: https://huggingface.co/datasets/masakhane/masakhanews/viewer/fra
MLSUM: We use the Multilingual Summarization Corpus (MLSUM) [12] of online newspapers for the clustering task. We filtered the dataset to only keep the French subset of the test set. In total, 15,800 online newspapers can be clustered according to their topic by their textual description (fields text and title).
Link to dataset: https://huggingface.co/datasets/mlsum/viewer/fr/test
Summarization
This task aims to score a machine-generated summary based on its similarity with human-written summaries. To do that, all summaries are embedded and distances between the machine-generated and human-written summaries are computed. The highest cosine similarity score is kept as the machine-generated summary score. Pearson and Spearman correlations with ground truth human assessments are used to evaluate the score computed. This task is close to the STS one.
SummEval: This dataset consists of 100 news articles from the CNN/DailyMail dataset [13]. Each of these news articles comes with 10 human-written summaries and 16 machine-generated summaries annotated by 8 persons across coherence, consistency, fluency, and relevance. As this dataset is only available in English, we translated it into French using DeepL. Human-written and machine-generated summaries are embedded and compared with cosine similarity and the average relevance of expert annotations is used as ground truth assessment.
Link to dataset: https://huggingface.co/datasets/lyon-nlp/summarization-summeval-fr-p2p
Semantic Textual Similarity (STS)
This task aims to compute the similarity between two sentences and give a continuous score. Here, pairs of sentences are labeled with a score between 1 and 5 (lower is low similarity and higher is high similarity). The score generated from embeddings is compared with the labeled score using Pearson and Spearman correlations.
STS Benchmark Multilingual: This dataset consists of a mix between several English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. It includes text from image captions, news headlines and user forums as pairs of sentences and a score of their similarity. The English version was translated in French with DeepL. We use the 1379 samples of the test set.
Link to dataset: https://huggingface.co/datasets/stsb_multi_mt
STS22 Crosslingual: This dataset consists of pairs of news articles, with their similarity labeled with a score between 0 and 5. It comprises 10 languages. For french evaluation of models, we only use the french subset which is composed of 104 article pairs.
Link to dataset: https://huggingface.co/datasets/mteb/sts22-crosslingual-sts/viewer/fr
SICK-FR: The Sentences Involving Compositional Knowldedge (SICK) dataset consists of about 10,000 English sentence pairs that include many examples of the lexical, syntactic and semantic phenomena. Each pair is labeled with both a "relatedness in meaning" score (with a 5-point rating scale) and an entailment relation (with three possible gold labels: entailment, contradiction, and neutral). For the purpose of this benchmark, we use SICK-FR : a french translation of SICK, along with the "relatedness in meaning" score.
Link to dataset: https://huggingface.co/datasets/Lajavaness/SICK-fr
Conclusion
We started this initiative after realizing that it was often difficult to select the right NLP method for French applications. Of course, there are many good multilingual models out there. But when looking closely at their training process, it appears that most of the training data is actually in English. And since the benchmarks evaluating these models are also in English, their performance in French is hard to assess.
One of the reasons for this might be the lack of good-quality French datasets. Indeed many datasets in French are either too specialized in a specific domain to be used in a benchmark or do not come in a "ready-to-use" format and need significant work in terms of cleaning and formatting.
Identifying and preparing relevant French datasets to be used for the MTEB-French was not a trivial task, and we hope that this work can help the community accelerate the evaluation of models. The next step for MTEB-French implementation is to identify relevant models to evaluate.
The models chosen, along with justification for their selection, will be the topic of the next article. Stay tuned! 😎
Bibliography
[1] Muennighoff, Niklas et al. “MTEB: Massive Text Embedding Benchmark.” Conference of the European Chapter of the Association for Computational Linguistics (2022).
[2] Bawden, Rachel et al. “DiaBLa: a corpus of bilingual spontaneous written dialogues for machine translation.” Language Resources and Evaluation 55 (2019): 635 - 660.
[3] team, Nllb et al. “No Language Left Behind: Scaling Human-Centered Machine Translation.” ArXiv abs/2207.04672 (2022): n. pag.
[4] Goyal, Naman et al. “The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation.” Transactions of the Association for Computational Linguistics 10 (2021): 522-538.
[5] Guzmán, Francisco et al. “Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English.” ArXiv abs/1902.01382 (2019): n. pag.
[6] Adelani, David Ifeoluwa et al. “MasakhaNEWS: News Topic Classification for African languages.” ArXiv abs/2304.09972 (2023): n. pag.
[7] FitzGerald, Jack G. M. et al. “MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages.” Annual Meeting of the Association for Computational Linguistics (2023).
[8] Bastianelli, Emanuele et al. “SLURP: A Spoken Language Understanding Resource Package.” Conference on Empirical Methods in Natural Language Processing (EMNLP 2020)
[9] Xia, Menglin and Emilio Monti. “Multilingual Neural Semantic Parsing for Low-Resourced Languages.” Conference on Lexical and Computational Semantics (2021).
[10] Creutz, Mathias. “Open Subtitles Paraphrase Corpus for Six Languages.” Conference on Language Resources and Evaluation (LREC 2018).
[11] Lefebvre-Brossard, Antoine et al. “Alloprof: a new French question-answer education dataset and its use in an information retrieval case study.” ArXiv abs/2302.07738 (2023): n. pag.
[12] Louis, Antoine et al. “A Statutory Article Retrieval Dataset in French.” Annual Meeting of the Association for Computational Linguistics (2021).
[13] Scialom, Thomas et al. “MLSUM: The Multilingual Summarization Corpus.” Conference on Empirical Methods in Natural Language Processing (2020).
[14] Fabbri, A. R. et al. “SummEval: Re-evaluating Summarization Evaluation.” Transactions of the Association for Computational Linguistics 9 (2020): 391-409.