Quick Tour
Text Embeddings
The easiest way to get started with TEI is to use one of the official Docker containers (see Supported models and hardware to choose the right container).
After making sure that your hardware is supported, install the NVIDIA Container Toolkit if you plan on utilizing GPUs. NVIDIA drivers on your device need to be compatible with CUDA version 12.2 or higher.
Next, install Docker following their installation instructions.
Finally, deploy your model. Let’s say you want to use BAAI/bge-large-en-v1.5
. Here’s how you can do this:
model=BAAI/bge-large-en-v1.5 volume=$PWD/data docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.6 --model-id $model
We also recommend sharing a volume with the Docker container (volume=$PWD/data
) to avoid downloading weights every run.
Once you have deployed a model, you can use the embed
endpoint by sending requests:
curl 127.0.0.1:8080/embed \
-X POST \
-d '{"inputs":"What is Deep Learning?"}' \
-H 'Content-Type: application/json'
Re-rankers
Re-rankers models are Sequence Classification cross-encoders models with a single class that scores the similarity between a query and a text.
See this blogpost by the LlamaIndex team to understand how you can use re-rankers models in your RAG pipeline to improve downstream performance.
Let’s say you want to use BAAI/bge-reranker-large
:
model=BAAI/bge-reranker-large volume=$PWD/data docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.6 --model-id $model
Once you have deployed a model, you can use the rerank
endpoint to rank the similarity between a query and a list
of texts:
curl 127.0.0.1:8080/rerank \
-X POST \
-d '{"query":"What is Deep Learning?", "texts": ["Deep Learning is not...", "Deep learning is..."], "raw_scores": false}' \
-H 'Content-Type: application/json'
Sequence Classification
You can also use classic Sequence Classification models like SamLowe/roberta-base-go_emotions
:
model=SamLowe/roberta-base-go_emotions volume=$PWD/data docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.6 --model-id $model
Once you have deployed the model you can use the predict
endpoint to get the emotions most associated with an input:
curl 127.0.0.1:8080/predict \
-X POST \
-d '{"inputs":"I like you."}' \
-H 'Content-Type: application/json'
Batching
You can send multiple inputs in a batch. For example, for embeddings
curl 127.0.0.1:8080/embed \
-X POST \
-d '{"inputs":["Today is a nice day", "I like you"]}' \
-H 'Content-Type: application/json'
And for Sequence Classification:
curl 127.0.0.1:8080/predict \
-X POST \
-d '{"inputs":[["I like you."], ["I hate pineapples"]]}' \
-H 'Content-Type: application/json'
Air gapped deployment
To deploy Text Embeddings Inference in an air-gapped environment, first download the weights and then mount them inside the container using a volume.
For example:
# (Optional) create a `models` directory
mkdir models
cd models
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5
# Set the models directory as the volume path
volume=$PWD
# Mount the models directory inside the container with a volume and set the model ID
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.6 --model-id /data/gte-base-en-v1.5