Jack Morris commited on
Commit
873605f
·
1 Parent(s): 4ad5fdb

add -v2 to README

Browse files
Files changed (1) hide show
  1. README.md +7 -7
README.md CHANGED
@@ -8661,7 +8661,7 @@ Our new model that naturally integrates "context tokens" into the embedding proc
8661
  <br>
8662
  <hr>
8663
 
8664
- # How to use `cde-small-v1`
8665
 
8666
  Our embedding model needs to be used in *two stages*. The first stage is to gather some dataset information by embedding a subset of the corpus using our "first-stage" model. The second stage is to actually embed queries and documents, conditioning on the corpus information from the first stage. Note that we can do the first stage part offline and only use the second-stage weights at inference time.
8667
 
@@ -8670,7 +8670,7 @@ Our embedding model needs to be used in *two stages*. The first stage is to gath
8670
  ## With Transformers
8671
 
8672
  <details>
8673
- <summary>Click to learn how to use cde-small-v1 with Transformers</summary>
8674
 
8675
  ### Loading the model
8676
 
@@ -8678,7 +8678,7 @@ Our model can be loaded using `transformers` out-of-the-box with "trust remote c
8678
  ```python
8679
  import transformers
8680
 
8681
- model = transformers.AutoModel.from_pretrained("jxm/cde-small-v1", trust_remote_code=True)
8682
  tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")
8683
  ```
8684
 
@@ -8767,13 +8767,13 @@ these embeddings can be compared using dot product, since they're normalized.
8767
 
8768
  ### What if I don't know what my corpus will be ahead of time?
8769
 
8770
- If you can't obtain corpus information ahead of time, you still have to pass *something* as the dataset embeddings; our model will work fine in this case, but not quite as well; without corpus information, our model performance drops from 65.0 to 63.8 on MTEB. We provide [some random strings](https://huggingface.co/jxm/cde-small-v1/resolve/main/random_strings.txt) that worked well for us that can be used as a substitute for corpus sampling.
8771
 
8772
 
8773
  ## With Sentence Transformers
8774
 
8775
  <details open="">
8776
- <summary>Click to learn how to use cde-small-v1 with Sentence Transformers</summary>
8777
 
8778
  ### Loading the model
8779
 
@@ -8781,7 +8781,7 @@ Our model can be loaded using `sentence-transformers` out-of-the-box with "trust
8781
  ```python
8782
  from sentence_transformers import SentenceTransformer
8783
 
8784
- model = SentenceTransformer("jxm/cde-small-v1", trust_remote_code=True)
8785
  ```
8786
 
8787
  #### Note on prefixes
@@ -8838,7 +8838,7 @@ from sentence_transformers import SentenceTransformer
8838
  from datasets import load_dataset
8839
 
8840
  # 1. Load the Sentence Transformer model
8841
- model = SentenceTransformer("jxm/cde-small-v1", trust_remote_code=True)
8842
  context_docs_size = model[0].config.transductive_corpus_size # 512
8843
 
8844
  # 2. Load the dataset: context dataset, docs, and queries
 
8661
  <br>
8662
  <hr>
8663
 
8664
+ # How to use `cde-small-v2`
8665
 
8666
  Our embedding model needs to be used in *two stages*. The first stage is to gather some dataset information by embedding a subset of the corpus using our "first-stage" model. The second stage is to actually embed queries and documents, conditioning on the corpus information from the first stage. Note that we can do the first stage part offline and only use the second-stage weights at inference time.
8667
 
 
8670
  ## With Transformers
8671
 
8672
  <details>
8673
+ <summary>Click to learn how to use cde-small-v2 with Transformers</summary>
8674
 
8675
  ### Loading the model
8676
 
 
8678
  ```python
8679
  import transformers
8680
 
8681
+ model = transformers.AutoModel.from_pretrained("jxm/cde-small-v2", trust_remote_code=True)
8682
  tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")
8683
  ```
8684
 
 
8767
 
8768
  ### What if I don't know what my corpus will be ahead of time?
8769
 
8770
+ If you can't obtain corpus information ahead of time, you still have to pass *something* as the dataset embeddings; our model will work fine in this case, but not quite as well; without corpus information, our model performance drops from 65.0 to 63.8 on MTEB. We provide [some random strings](https://huggingface.co/jxm/cde-small-v2/resolve/main/random_strings.txt) that worked well for us that can be used as a substitute for corpus sampling.
8771
 
8772
 
8773
  ## With Sentence Transformers
8774
 
8775
  <details open="">
8776
+ <summary>Click to learn how to use cde-small-v2 with Sentence Transformers</summary>
8777
 
8778
  ### Loading the model
8779
 
 
8781
  ```python
8782
  from sentence_transformers import SentenceTransformer
8783
 
8784
+ model = SentenceTransformer("jxm/cde-small-v2", trust_remote_code=True)
8785
  ```
8786
 
8787
  #### Note on prefixes
 
8838
  from datasets import load_dataset
8839
 
8840
  # 1. Load the Sentence Transformer model
8841
+ model = SentenceTransformer("jxm/cde-small-v2", trust_remote_code=True)
8842
  context_docs_size = model[0].config.transductive_corpus_size # 512
8843
 
8844
  # 2. Load the dataset: context dataset, docs, and queries