ModernBert for multi-vector embeddings
I do not know if this idea holds any weight or I just don't understand how multi-vector embedding models work but could token embeddings extracted from the layer just before pooling be effectively used for late interaction and be expected to perform at least as well as colbertv2?
Hello,
ModernBERT is a base model, so there is actually no pooling and there is one embedding per token. There is however the prediction layer (which output the probability over the vocabulary) that needs to be removed to use the embedding directly).
Thus, you can indeed use ModernBERT to train multi-vector embedding model (such as ColBERTv2), but please note that this not working "out of the box", as the model is only trained to predict masked word and not do multi-vector retrieval. We actually did train such a model for our experiments (see ColBERT experiments in the paper). The script used to train such model is available here.
And more broadly, if you are interested in training multi-vector models, you can do so with any base model using PyLate.
Thank you! I have been trying to get the tokenizer to return offsets_mapping but it seems not to be possible. Is it because ModernBert doesn't support it or the problem is on the side of the transformers library?
In any case is there a work around one might use to get this? I am trying to do late chunking.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
tokenized = (tokenizer("Hello, my dog is cute", return_offsets_mapping=True))
print(tokenized)
print(tokenizer.convert_ids_to_tokens(tokenized["input_ids"]))
returns
{'input_ids': [50281, 12092, 13, 619, 4370, 310, 20295, 50282], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1], 'offset_mapping': [(0, 0), (0, 5), (5, 6), (6, 9), (9, 13), (13, 16), (16, 21), (0, 0)]}
['[CLS]', 'Hello', ',', 'Ġmy', 'Ġdog', 'Ġis', 'Ġcute', '[SEP]']
So it seems correct to me.
What exactly is failing in your case?