yyqoni commited on
Commit
2660234
·
verified ·
1 Parent(s): cac634a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -2
README.md CHANGED
@@ -5,20 +5,29 @@ datasets:
5
  - hendrydong/preference_700K
6
  base_model:
7
  - microsoft/Phi-3-mini-4k-instruct
 
8
  ---
9
 
10
 
11
  # phi-instruct-segment Model Card
12
 
 
 
 
 
 
13
  ## Method
14
 
15
 
16
  The segment reward model assigns rewards to semantically meaningful text segments, segmented dynamically with an entropy-based threshold. It is trained on binary preference labels from human feedback, optimizing a Bradley-Terry loss function that aggregates segment rewards using the average function.
17
 
 
18
  <div align=center>
19
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/605e8dfd5abeb13e714c4c18/GnDEETLQeFpqx7-enIENw.png)
 
 
20
  </div>
21
- ---
22
 
23
  ## Training
24
 
 
5
  - hendrydong/preference_700K
6
  base_model:
7
  - microsoft/Phi-3-mini-4k-instruct
8
+ pipeline_tag: text-classification
9
  ---
10
 
11
 
12
  # phi-instruct-segment Model Card
13
 
14
+ - **Paper:** [Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model
15
+ ](https://arxiv.org/abs/2501.02790)
16
+
17
+ - **Model:** [yyqoni/Phi-3-mini-4k-instruct-segment-rm-700k](yyqoni/Phi-3-mini-4k-segment-ppo-60k](https://huggingface.co/yyqoni/Phi-3-mini-4k-instruct-segment-rm-700k)
18
+
19
  ## Method
20
 
21
 
22
  The segment reward model assigns rewards to semantically meaningful text segments, segmented dynamically with an entropy-based threshold. It is trained on binary preference labels from human feedback, optimizing a Bradley-Terry loss function that aggregates segment rewards using the average function.
23
 
24
+ ## Architecture
25
  <div align=center>
26
+
27
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/605e8dfd5abeb13e714c4c18/xeGwtrpnx2bWFg5ZOHA7R.png)
28
+
29
  </div>
30
+
31
 
32
  ## Training
33