Transformers
PyTorch
code
English
custom_code
Inference Endpoints

SageLite-l

Model Description

SageLite is a new family of open embedding models with an encoder architecture that supports a wide range of tasks in both code and text. SageLite went through three stages of training:

  1. MLM Pretraining: Standard masked language model (MLM) pretraining on mixed code and text data (The-Stack-v2 and Falcon-refinedweb).
  2. Contrastive Pre-Finetuning: Learning from a large amount of positive pairs mined from web data and GitHub.
  3. Contrastive Fine-Tuning: Fine-tuning on a small amount of synthetic data.

Training Data

This checkpoint is trained on both The-Stack-v2 and Falcon-refinedweb. Supported languages (15 in total) are: English, C, C#, Go, Java, JavaScript, TypeScript, PHP, Python, and Ruby.


How to Use

This checkpoint consists of an encoder (80M model) that extracts code embeddings of 768 dimensions. It can be loaded using the Hugging Face Transformers library and employs the Starcoder Tokenizer.

from transformers import AutoModel, AutoTokenizer

# Specify the checkpoint
checkpoint = "SageLite/SageLite-l"
device = "cuda"  # Use "cpu" if GPU is unavailable

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, add_eos_token=True)
model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)

# Example usage
code_snippet = "def print_hello_world():\tprint('Hello World!')"
inputs = tokenizer.encode(code_snippet, return_tensors="pt").to(device)
embedding = model(inputs)[0]  # Extract the embedding

Code Retrieval Performance

1. Code2Code Search

Model Name # Params Embd Dim Python Java JS TS C# C Ruby PhP GO AVG
OpenAI-Code-01 NA 3072 21.92 8.90 4.90 5.70 3.15 11.58 26.25 16.60 9.40 12.04
OpenAI-Text-3-Small NA 1536 25.18 12.61 8.00 9.44 5.46 15.86 30.70 23.33 11.20 15.57
OpenAI-Text-3-Large NA 3072 40.57 25.33 20.09 22.00 11.84 31.90 42.54 41.84 21.75 28.65
CodeSage-v2-Small 130M 1024 45.60 33.65 39.96 47.78 19.19 30.55 40.12 55.39 30.96 38.13
CodeSage-v2-Base 356M 1024 55.86 42.89 45.29 54.58 23.90 38.52 56.02 64.56 42.88 47.17
CodeSage-v2-Large 1.3B 2048 61.11 47.09 51.18 60.67 28.04 43.40 60.74 67.87 43.86 51.55
SageLite-s 80M 768 47.93 30.83 35.15 37.64 18.14 30.53 42.89 50.70 21.69 35.06
SageLite-l 850M 1536 64.46 45.53 50.80 54.71 30.66 47.46 61.01 68.68 39.25 51.40

2. NL2Code Search

Model Name # Params CoSQA AdvTest Python Java JS PhP GO Ruby Avg
OpenAI-Code-01 NA 52.20 36.03 63.13 67.85 62.30 57.47 85.22 69.28 61.69
OpenAI-Text-3-Small NA 52.48 34.10 62.62 65.87 60.28 54.85 81.96 67.57 59.97
OpenAI-Text-3-Large NA 55.21 46.83 70.81 72.89 68.12 59.58 87.60 75.22 67.03
CodeSage-v2-Small 130M 52.39 47.28 68.79 68.13 65.77 60.20 80.26 72.46 64.41
CodeSage-v2-Base 356M 50.74 52.00 70.46 70.89 69.61 62.81 82.37 73.71 66.57
CodeSage-v2-Large 1.3B 53.18 56.31 74.18 72.33 72.49 65.26 84.67 76.61 69.38
SageLite-s 80M 56.49 42.32 67.59 66.62 62.32 58.87 79.36 70.75 63.04
SageLite-l 850M 59.76 55.55 74.25 71.76 69.35 61.62 84.09 77.14 69.19

Text Retrieval Performance (MTEB Retrieval)

Metric SageLite-s SageLite-l
ArguAna 57.75 60.71
CQADupstackWordpressRetrieval 32.42 38.63
FiQA2018 34.85 46.73
NFCorpus 29.97 33.70
QuoraRetrieval 85.35 87.50
SCIDOCS 18.99 21.38
SciFact 68.43 69.05
Touche2020 24.41 21.43
TRECCOVID 70.88 76.08
FEVER 71.72 73.64
HotpotQA 58.81 62.96
NQ 48.26 54.48
DBPedia 34.83 40.69
ClimateFEVER 25.69 26.20
MSMARCO 35.01 36.55
average 46.49 49.98

Downloads last month
48
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Datasets used to train SageLite/SageLite-l