yyqoni
/

Phi-3-mini-4k-instruct-segment-rm-700k

Text Classification

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

yyqoni commited on 11 days ago

Commit

2660234

·

verified ·

1 Parent(s): cac634a

Update README.md

Files changed (1) hide show

README.md +11 -2

README.md CHANGED Viewed

@@ -5,20 +5,29 @@ datasets:
 - hendrydong/preference_700K
 base_model:
 - microsoft/Phi-3-mini-4k-instruct
 ---
 # phi-instruct-segment Model Card
 ## Method
 The segment reward model assigns rewards to semantically meaningful text segments, segmented dynamically with an entropy-based threshold. It is trained on binary preference labels from human feedback, optimizing a Bradley-Terry loss function that aggregates segment rewards using the average function.
 <div align=center>
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/605e8dfd5abeb13e714c4c18/GnDEETLQeFpqx7-enIENw.png)
 </div>
----
 ## Training

 - hendrydong/preference_700K
 base_model:
 - microsoft/Phi-3-mini-4k-instruct
+pipeline_tag: text-classification
 ---
 # phi-instruct-segment Model Card
+- **Paper:** [Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model
+](https://arxiv.org/abs/2501.02790)
+- **Model:** [yyqoni/Phi-3-mini-4k-instruct-segment-rm-700k](yyqoni/Phi-3-mini-4k-segment-ppo-60k](https://huggingface.co/yyqoni/Phi-3-mini-4k-instruct-segment-rm-700k)
 ## Method
 The segment reward model assigns rewards to semantically meaningful text segments, segmented dynamically with an entropy-based threshold. It is trained on binary preference labels from human feedback, optimizing a Bradley-Terry loss function that aggregates segment rewards using the average function.
+## Architecture
 <div align=center>
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/605e8dfd5abeb13e714c4c18/xeGwtrpnx2bWFg5ZOHA7R.png)
 </div>
 ## Training