# Fast-Inference with Ctranslate2
Speedup inference while reducing memory by 2x-4x using int8 inference in C++ on CPU or GPU.
quantized version of jinaai/jina-embedding-t-en-v1
pip install hf-hub-ctranslate2>=2.12.0 ctranslate2>=3.17.1
# from transformers import AutoTokenizer
model_name = "michaelfeil/ct2fast-jina-embedding-t-en-v1"
model_name_orig="jinaai/jina-embedding-t-en-v1"
from hf_hub_ctranslate2 import EncoderCT2fromHfHub
model = EncoderCT2fromHfHub(
# load in int8 on CUDA
model_name_or_path=model_name,
device="cuda",
compute_type="int8_float16"
)
outputs = model.generate(
text=["I like soccer", "I like tennis", "The eiffel tower is in Paris"],
max_length=64,
) # perform downstream tasks on outputs
outputs["pooler_output"]
outputs["last_hidden_state"]
outputs["attention_mask"]
# alternative, use SentenceTransformer Mix-In
# for end-to-end Sentence embeddings generation
# (not pulling from this CT2fast-HF repo)
from hf_hub_ctranslate2 import CT2SentenceTransformer
model = CT2SentenceTransformer(
model_name_orig, compute_type="int8_float16", device="cuda"
)
embeddings = model.encode(
["I like soccer", "I like tennis", "The eiffel tower is in Paris"],
batch_size=32,
convert_to_numpy=True,
normalize_embeddings=True,
)
print(embeddings.shape, embeddings)
scores = (embeddings @ embeddings.T) * 100
# Hint: you can also host this code via REST API and
# via github.com/michaelfeil/infinity
Checkpoint compatible to ctranslate2>=3.17.1 and hf-hub-ctranslate2>=2.12.0
compute_type=int8_float16
fordevice="cuda"
compute_type=int8
fordevice="cpu"
Converted on 2023-10-13 using
LLama-2 -> removed <pad> token.
Licence and other remarks:
This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo.
Original description
The text embedding set trained by Jina AI, Finetuner team.
Intented Usage & Model Info
jina-embedding-t-en-v1
is a tiny language model that has been trained using Jina AI's Linnaeus-Clean dataset.
This dataset consists of 380 million pairs of sentences, which include both query-document pairs.
These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
The Linnaeus-Full dataset, from which the Linnaeus-Clean dataset is derived, originally contained 1.6 billion sentence pairs.
The model has a range of use cases, including information retrieval, semantic textual similarity, text reranking, and more.
With a tiny small parameter size of just 14 million parameters, the model enables lightning-fast inference on CPU, while still delivering impressive performance. Additionally, we provide the following options:
jina-embedding-t-en-v1
: 14 million parameters (you are here).jina-embedding-s-en-v1
: 35 million parameters.jina-embedding-b-en-v1
: 110 million parameters.jina-embedding-l-en-v1
: 330 million parameters.jina-embedding-1b-en-v1
: 1.2 billion parameters, 10 times bert-base (soon).jina-embedding-6b-en-v1
: 6 billion parameters, 30 times bert-base (soon).
Data & Parameters
Please checkout our technical blog.
Metrics
We compared the model against all-minilm-l6-v2
/all-mpnet-base-v2
from sbert and text-embeddings-ada-002
from OpenAI:
Name | param | dimension |
---|---|---|
all-minilm-l6-v2 | 23m | 384 |
all-mpnet-base-v2 | 110m | 768 |
ada-embedding-002 | Unknown/OpenAI API | 1536 |
jina-embedding-t-en-v1 | 14m | 312 |
jina-embedding-s-en-v1 | 35m | 512 |
jina-embedding-b-en-v1 | 110m | 768 |
jina-embedding-l-en-v1 | 330m | 1024 |
Name | STS12 | STS13 | STS14 | STS15 | STS16 | STS17 | TRECOVID | Quora | SciFact |
---|---|---|---|---|---|---|---|---|---|
all-minilm-l6-v2 | 0.724 | 0.806 | 0.756 | 0.854 | 0.79 | 0.876 | 0.473 | 0.876 | 0.645 |
all-mpnet-base-v2 | 0.726 | 0.835 | 0.78 | 0.857 | 0.8 | 0.906 | 0.513 | 0.875 | 0.656 |
ada-embedding-002 | 0.698 | 0.833 | 0.761 | 0.861 | 0.86 | 0.903 | 0.685 | 0.876 | 0.726 |
jina-embedding-t-en-v1 | 0.717 | 0.773 | 0.731 | 0.829 | 0.777 | 0.860 | 0.482 | 0.840 | 0.522 |
jina-embedding-s-en-v1 | 0.743 | 0.786 | 0.738 | 0.837 | 0.80 | 0.875 | 0.523 | 0.857 | 0.524 |
jina-embedding-b-en-v1 | 0.751 | 0.809 | 0.761 | 0.856 | 0.812 | 0.890 | 0.606 | 0.876 | 0.594 |
jina-embedding-l-en-v1 | 0.745 | 0.832 | 0.781 | 0.869 | 0.837 | 0.902 | 0.573 | 0.881 | 0.598 |
Inference Speed
We encoded a single sentence "What is the current weather like today?" 10k times on:
- cpu: MacBook Pro 2020, 2 GHz Quad-Core Intel Core i5
- gpu: 1 Nvidia 3090
And recorded time spent to demonstrate the embedding speed:
Name | param | dimension | time@cpu | time@gpu |
---|---|---|---|---|
jina-embedding-t-en-v1 | 14m | 312 | 5.78s | 2.36s |
all-minilm-l6-v2 | 23m | 384 | 11.95s | 2.70s |
jina-embedding-s-en-v1 | 35m | 512 | 17.25s | 2.81s |
Usage
Use with Jina AI Finetuner
!pip install finetuner
import finetuner
model = finetuner.build_model('jinaai/jina-embedding-t-en-v1')
embeddings = finetuner.encode(
model=model,
data=['how is the weather today', 'What is the current weather like today?']
)
print(finetuner.cos_sim(embeddings[0], embeddings[1]))
Use with sentence-transformers:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
sentences = ['how is the weather today', 'What is the current weather like today?']
model = SentenceTransformer('jinaai/jina-embedding-t-en-v1')
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))
Fine-tuning
Please consider Finetuner.
Plans
- The development of
jina-embedding-s-en-v2
is currently underway with two main objectives: improving performance and increasing the maximum sequence length. - We are currently working on a bilingual embedding model that combines English and X language. The upcoming model will be called
jina-embedding-s/b/l-de-v1
.
Contact
Join our Discord community and chat with other community members about ideas.
Citation
If you find Jina Embeddings useful in your research, please cite the following paper:
@misc{günther2023jina,
title={Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models},
author={Michael Günther and Louis Milliken and Jonathan Geuter and Georgios Mastrapas and Bo Wang and Han Xiao},
year={2023},
eprint={2307.11224},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- Downloads last month
- 2