Tom Aarsen commited on
Commit
e30d07f
·
1 Parent(s): 08c012b

Add MRL results and usage

Browse files
Files changed (1) hide show
  1. README.md +31 -1
README.md CHANGED
@@ -997,7 +997,11 @@ model-index:
997
 
998
  # Static Embeddings with BERT uncased tokenizer finetuned on various datasets
999
 
1000
- This is a [sentence-transformers](https://www.SBERT.net) model trained on the [gooaq](https://huggingface.co/datasets/sentence-transformers/gooaq), [msmarco](https://huggingface.co/datasets/sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1), [squad](https://huggingface.co/datasets/sentence-transformers/squad), [s2orc](https://huggingface.co/datasets/sentence-transformers/s2orc), [allnli](https://huggingface.co/datasets/sentence-transformers/all-nli), [paq](https://huggingface.co/datasets/sentence-transformers/paq), [trivia_qa](https://huggingface.co/datasets/sentence-transformers/trivia-qa), [msmarco_10m](https://huggingface.co/datasets/bclavie/msmarco-10m-triplets), [swim_ir](https://huggingface.co/datasets/nthakur/swim-ir-monolingual), [pubmedqa](https://huggingface.co/datasets/sentence-transformers/pubmedqa), [miracl](https://huggingface.co/datasets/sentence-transformers/miracl), [mldr](https://huggingface.co/datasets/sentence-transformers/mldr) and [mr_tydi](https://huggingface.co/datasets/sentence-transformers/mr-tydi) datasets. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
 
 
 
 
1001
 
1002
  ## Model Details
1003
 
@@ -1072,6 +1076,21 @@ print(similarities.shape)
1072
  # [3, 3]
1073
  ```
1074
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1075
  <!--
1076
  ### Direct Usage (Transformers)
1077
 
@@ -1146,6 +1165,17 @@ You can finetune this model on your own dataset.
1146
  | cosine_mrr@10 | 0.5482 |
1147
  | cosine_map@100 | 0.4203 |
1148
 
 
 
 
 
 
 
 
 
 
 
 
1149
  <!--
1150
  ## Bias, Risks and Limitations
1151
 
 
997
 
998
  # Static Embeddings with BERT uncased tokenizer finetuned on various datasets
999
 
1000
+ This is a [sentence-transformers](https://www.SBERT.net) model trained on the [gooaq](https://huggingface.co/datasets/sentence-transformers/gooaq), [msmarco](https://huggingface.co/datasets/sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1), [squad](https://huggingface.co/datasets/sentence-transformers/squad), [s2orc](https://huggingface.co/datasets/sentence-transformers/s2orc), [allnli](https://huggingface.co/datasets/sentence-transformers/all-nli), [paq](https://huggingface.co/datasets/sentence-transformers/paq), [trivia_qa](https://huggingface.co/datasets/sentence-transformers/trivia-qa), [msmarco_10m](https://huggingface.co/datasets/bclavie/msmarco-10m-triplets), [swim_ir](https://huggingface.co/datasets/nthakur/swim-ir-monolingual), [pubmedqa](https://huggingface.co/datasets/sentence-transformers/pubmedqa), [miracl](https://huggingface.co/datasets/sentence-transformers/miracl), [mldr](https://huggingface.co/datasets/sentence-transformers/mldr) and [mr_tydi](https://huggingface.co/datasets/sentence-transformers/mr-tydi) datasets. It maps sentences & paragraphs to a 1024-dimensional dense vector space and is designed to be used for semantic search.
1001
+
1002
+ This model was trained with a [Matryoshka loss](https://huggingface.co/blog/matryoshka), allowing you to truncate the embeddings for faster retrieval at minimal performance costs (See [Matryoshka Evaluations](#matryoshka-evaluations) for evaluations).
1003
+
1004
+
1005
 
1006
  ## Model Details
1007
 
 
1076
  # [3, 3]
1077
  ```
1078
 
1079
+ This model was trained with Matryoshka loss, allowing this model to be used with lower dimensionalities with minimal performance loss (See [Matryoshka Evaluations](#matryoshka-evaluations) for evaluations).
1080
+ Notably, a lower dimensionality allows for much faster and cheaper information retrieval. You can specify a lower dimensionality with the `truncate_dim` argument when initializing the Sentence Transformer model:
1081
+
1082
+ ```python
1083
+ from sentence_transformers import SentenceTransformer
1084
+
1085
+ model = SentenceTransformer("tomaarsen/static-retrieval-mrl-en-v1", truncate_dim=256)
1086
+ embeddings = model.encode([
1087
+ "what is the difference between chronological order and spatial order?",
1088
+ "can lavender grow indoors?"
1089
+ ])
1090
+ print(embeddings.shape)
1091
+ # => (2, 256)
1092
+ ```
1093
+
1094
  <!--
1095
  ### Direct Usage (Transformers)
1096
 
 
1165
  | cosine_mrr@10 | 0.5482 |
1166
  | cosine_map@100 | 0.4203 |
1167
 
1168
+ ##### Matryoshka Evaluations
1169
+
1170
+ | Dimensionality | NanoBEIR_mean | NanoArguAna | NanoClimateFEVER | NanoDBPedia | NanoFEVER | NanoFiQA2018 | NanoHotpotQA | NanoMSMARCO | NanoNFCorpus | NanoNQ | NanoQuoraRetrieval | NanoSCIDOCS | NanoSciFact | NanoTouche2020 |
1171
+ |----------------|---------------|-------------|------------------|-------------|-----------|--------------|--------------|-------------|--------------|--------|--------------------|-------------|-------------|----------------|
1172
+ | 1024 | **0.5031** | 0.4077 | 0.3308 | 0.5681 | 0.6921 | 0.3651 | 0.6547 | 0.4040 | 0.3241 | 0.4533 | 0.8950 | 0.2642 | 0.6111 | 0.5702 |
1173
+ | 512 | **0.4957** | 0.3878 | 0.3360 | 0.5626 | 0.6945 | 0.3517 | 0.6280 | 0.3892 | 0.3206 | 0.4505 | 0.8986 | 0.2657 | 0.5953 | 0.5635 |
1174
+ | 256 | **0.4819** | 0.3855 | 0.3203 | 0.5407 | 0.6734 | 0.3518 | 0.6027 | 0.4144 | 0.2860 | 0.4254 | 0.8948 | 0.2466 | 0.5620 | 0.5605 |
1175
+ | 128 | **0.4622** | 0.4001 | 0.2982 | 0.5266 | 0.6273 | 0.3188 | 0.5606 | 0.4025 | 0.2693 | 0.4021 | 0.8930 | 0.2283 | 0.5447 | 0.5368 |
1176
+ | 64 | **0.4176** | 0.3424 | 0.2809 | 0.5022 | 0.5480 | 0.2831 | 0.4680 | 0.3739 | 0.2153 | 0.3845 | 0.8525 | 0.1680 | 0.5045 | 0.5050 |
1177
+ | 32 | **0.3532** | 0.2866 | 0.1870 | 0.4292 | 0.4193 | 0.2292 | 0.3602 | 0.3587 | 0.1444 | 0.3525 | 0.8325 | 0.1525 | 0.3983 | 0.4408 |
1178
+
1179
  <!--
1180
  ## Bias, Risks and Limitations
1181