Jack Morris
commited on
Commit
·
873605f
1
Parent(s):
4ad5fdb
add -v2 to README
Browse files
README.md
CHANGED
@@ -8661,7 +8661,7 @@ Our new model that naturally integrates "context tokens" into the embedding proc
|
|
8661 |
<br>
|
8662 |
<hr>
|
8663 |
|
8664 |
-
# How to use `cde-small-
|
8665 |
|
8666 |
Our embedding model needs to be used in *two stages*. The first stage is to gather some dataset information by embedding a subset of the corpus using our "first-stage" model. The second stage is to actually embed queries and documents, conditioning on the corpus information from the first stage. Note that we can do the first stage part offline and only use the second-stage weights at inference time.
|
8667 |
|
@@ -8670,7 +8670,7 @@ Our embedding model needs to be used in *two stages*. The first stage is to gath
|
|
8670 |
## With Transformers
|
8671 |
|
8672 |
<details>
|
8673 |
-
<summary>Click to learn how to use cde-small-
|
8674 |
|
8675 |
### Loading the model
|
8676 |
|
@@ -8678,7 +8678,7 @@ Our model can be loaded using `transformers` out-of-the-box with "trust remote c
|
|
8678 |
```python
|
8679 |
import transformers
|
8680 |
|
8681 |
-
model = transformers.AutoModel.from_pretrained("jxm/cde-small-
|
8682 |
tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")
|
8683 |
```
|
8684 |
|
@@ -8767,13 +8767,13 @@ these embeddings can be compared using dot product, since they're normalized.
|
|
8767 |
|
8768 |
### What if I don't know what my corpus will be ahead of time?
|
8769 |
|
8770 |
-
If you can't obtain corpus information ahead of time, you still have to pass *something* as the dataset embeddings; our model will work fine in this case, but not quite as well; without corpus information, our model performance drops from 65.0 to 63.8 on MTEB. We provide [some random strings](https://huggingface.co/jxm/cde-small-
|
8771 |
|
8772 |
|
8773 |
## With Sentence Transformers
|
8774 |
|
8775 |
<details open="">
|
8776 |
-
<summary>Click to learn how to use cde-small-
|
8777 |
|
8778 |
### Loading the model
|
8779 |
|
@@ -8781,7 +8781,7 @@ Our model can be loaded using `sentence-transformers` out-of-the-box with "trust
|
|
8781 |
```python
|
8782 |
from sentence_transformers import SentenceTransformer
|
8783 |
|
8784 |
-
model = SentenceTransformer("jxm/cde-small-
|
8785 |
```
|
8786 |
|
8787 |
#### Note on prefixes
|
@@ -8838,7 +8838,7 @@ from sentence_transformers import SentenceTransformer
|
|
8838 |
from datasets import load_dataset
|
8839 |
|
8840 |
# 1. Load the Sentence Transformer model
|
8841 |
-
model = SentenceTransformer("jxm/cde-small-
|
8842 |
context_docs_size = model[0].config.transductive_corpus_size # 512
|
8843 |
|
8844 |
# 2. Load the dataset: context dataset, docs, and queries
|
|
|
8661 |
<br>
|
8662 |
<hr>
|
8663 |
|
8664 |
+
# How to use `cde-small-v2`
|
8665 |
|
8666 |
Our embedding model needs to be used in *two stages*. The first stage is to gather some dataset information by embedding a subset of the corpus using our "first-stage" model. The second stage is to actually embed queries and documents, conditioning on the corpus information from the first stage. Note that we can do the first stage part offline and only use the second-stage weights at inference time.
|
8667 |
|
|
|
8670 |
## With Transformers
|
8671 |
|
8672 |
<details>
|
8673 |
+
<summary>Click to learn how to use cde-small-v2 with Transformers</summary>
|
8674 |
|
8675 |
### Loading the model
|
8676 |
|
|
|
8678 |
```python
|
8679 |
import transformers
|
8680 |
|
8681 |
+
model = transformers.AutoModel.from_pretrained("jxm/cde-small-v2", trust_remote_code=True)
|
8682 |
tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")
|
8683 |
```
|
8684 |
|
|
|
8767 |
|
8768 |
### What if I don't know what my corpus will be ahead of time?
|
8769 |
|
8770 |
+
If you can't obtain corpus information ahead of time, you still have to pass *something* as the dataset embeddings; our model will work fine in this case, but not quite as well; without corpus information, our model performance drops from 65.0 to 63.8 on MTEB. We provide [some random strings](https://huggingface.co/jxm/cde-small-v2/resolve/main/random_strings.txt) that worked well for us that can be used as a substitute for corpus sampling.
|
8771 |
|
8772 |
|
8773 |
## With Sentence Transformers
|
8774 |
|
8775 |
<details open="">
|
8776 |
+
<summary>Click to learn how to use cde-small-v2 with Sentence Transformers</summary>
|
8777 |
|
8778 |
### Loading the model
|
8779 |
|
|
|
8781 |
```python
|
8782 |
from sentence_transformers import SentenceTransformer
|
8783 |
|
8784 |
+
model = SentenceTransformer("jxm/cde-small-v2", trust_remote_code=True)
|
8785 |
```
|
8786 |
|
8787 |
#### Note on prefixes
|
|
|
8838 |
from datasets import load_dataset
|
8839 |
|
8840 |
# 1. Load the Sentence Transformer model
|
8841 |
+
model = SentenceTransformer("jxm/cde-small-v2", trust_remote_code=True)
|
8842 |
context_docs_size = model[0].config.transductive_corpus_size # 512
|
8843 |
|
8844 |
# 2. Load the dataset: context dataset, docs, and queries
|