Multilingual E5 Text Embeddings: A Technical Report
Abstract
This technical report presents the training methodology and evaluation results of the open-source multilingual E5 text embedding models, released in mid-2023. Three embedding models of different sizes (small / base / large) are provided, offering a balance between the inference efficiency and embedding quality. The training procedure adheres to the English E5 model recipe, involving contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on a combination of labeled datasets. Additionally, we introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes. Information regarding the model release can be found at https://github.com/microsoft/unilm/tree/master/e5 .
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Improving Text Embeddings with Large Language Models (2023)
- Nomic Embed: Training a Reproducible Long Context Text Embedder (2024)
- BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation (2024)
- JaColBERT and Hard Negatives, Towards Better Japanese-First Embeddings for Retrieval: Early Technical Report (2023)
- Adapting Large Language Models for Document-Level Machine Translation (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
@intfloat
Thank you for your work!
When training "mE5-large-instruct" did you use only the synthetic data or synthetic + msmarco or synthetic + full data (I am referring to the notations introduced in "Improving Text Embeddings with Large Language Models")
It is the "synthetic data + full data" setting, the same data mixture as the released e5-mistral-7b-instruct
model.
Models citing this paper 20
Browse 20 models citing this paperDatasets citing this paper 0
No dataset linking this paper