akiFQCint commited on
Commit
edafeb4
·
1 Parent(s): a80319c

fix readme

Browse files
Files changed (2) hide show
  1. README.md +12 -1
  2. README_JA.md +1 -1
README.md CHANGED
@@ -27,7 +27,7 @@ datasets:
27
  **[日本語のREADME/Japanese README](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/README_JA.md)**
28
 
29
  "Sarashina-Embedding-v1-1B" is a Japanese text embedding model, based on the 1.2B-parameter Japansese LLM "[Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)".
30
- We trained this model with multi-stage contrastive learning. We achieved the state-of-the-art average score in the average of 16 datasets in [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)(Japanese Massive Text Embedding Benchmark).
31
 
32
  This model maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
33
 
@@ -120,6 +120,17 @@ To achieve generic text embedding performance across a wide range of domains, we
120
 
121
  To enable the model to learn a more accurate query-document similarity, we performed supervised fine-tuning using the following dataset.
122
 
 
 
 
 
 
 
 
 
 
 
 
123
  # Evaluation Results with [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)
124
 
125
  Model |Max Tokens|Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
 
27
  **[日本語のREADME/Japanese README](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/README_JA.md)**
28
 
29
  "Sarashina-Embedding-v1-1B" is a Japanese text embedding model, based on the 1.2B-parameter Japansese LLM "[Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)".
30
+ We trained this model with multi-stage contrastive learning. We achieved the state-of-the-art average score in the average of 16 datasets in [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB) (Japanese Massive Text Embedding Benchmark).
31
 
32
  This model maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
33
 
 
120
 
121
  To enable the model to learn a more accurate query-document similarity, we performed supervised fine-tuning using the following dataset.
122
 
123
+ #### Dataset
124
+
125
+ |dataset|counts|
126
+ |:-:|:-:|
127
+ |JSNLI|141,388 |
128
+ |NU-MNLI|67,987|
129
+ |Mr. TyDi (only Japanese subset)| 3,697 |
130
+ |Natural Question (sampled)| 20,000|
131
+ |||
132
+ |**total**|**233,072**|
133
+
134
  # Evaluation Results with [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)
135
 
136
  Model |Max Tokens|Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
README_JA.md CHANGED
@@ -24,7 +24,7 @@ datasets:
24
 
25
  「Sarashina-embedding-v1-1b」は、1.2Bパラメータの日本語LLM「[Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)」をベースにした日本語テキスト埋め込みモデルです。
26
 
27
- このモデルは、マルチステージの対照学習で訓練し、 [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)(Japanese Massive Text Embedding Benchmark)の16個のデータセットの平均で、(2024/12/1時点で)最高水準の平均スコアを達成しました。
28
 
29
  このモデルは、文や段落を1792次元の高密度ベクトル空間にマッピングし、意味的テキスト類似度、意味的検索、paraphrase mining、テキスト分類、クラスタリングなどに使用できます。
30
 
 
24
 
25
  「Sarashina-embedding-v1-1b」は、1.2Bパラメータの日本語LLM「[Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)」をベースにした日本語テキスト埋め込みモデルです。
26
 
27
+ このモデルは、マルチステージの対照学習で訓練し、 [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB) (Japanese Massive Text Embedding Benchmark)の16個のデータセットの平均で、(2024/12/1時点で)最高水準の平均スコアを達成しました。
28
 
29
  このモデルは、文や段落を1792次元の高密度ベクトル空間にマッピングし、意味的テキスト類似度、意味的検索、paraphrase mining、テキスト分類、クラスタリングなどに使用できます。
30