|
--- |
|
library_name: transformers |
|
license: mit |
|
datasets: |
|
- argilla/ultrafeedback-binarized-preferences-cleaned |
|
base_model: |
|
- microsoft/Phi-3-mini-4k-instruct |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# phi-instruct-segment-ppo Model Card |
|
|
|
The *phi-instruct-segment-ppo* model introduces a segment-level reward model to improve reinforcement learning with human feedback (RLHF) in language models. This work builds upon the methods in our paper *[Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model](https://arxiv.org/abs/2501.02790)*. |
|
|
|
--- |
|
|
|
## Method Illustration |
|
|
|
Below is an illustration of the segment-based reward modeling method, showing how entropy thresholds are used for segmentation, integrating both the reward model and PPO training: |
|
|
|
## Architecture |
|
<div align=center> |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/605e8dfd5abeb13e714c4c18/xeGwtrpnx2bWFg5ZOHA7R.png) |
|
|
|
</div> |
|
|
|
|
|
|
|
--- |
|
|
|
## Model Overview |
|
|
|
This approach redefines the granularity of RLHF training by: |
|
|
|
- Assigning rewards to semantically complete text segments, defined based on entropy thresholds. |
|
- Introducing techniques to stabilize RLHF training under dense, segment-level rewards. |
|
|
|
Model checkpoints are available on [HuggingFace](https://huggingface.co/collections/yyqoni/denserewardrlhf-ppo-677d39b5521f1e366c196f14). |
|
|
|
--- |
|
|
|
## Training Data |
|
|
|
We utilize the following datasets in our training pipeline: |
|
|
|
- **Preference-700K Dataset**: A diverse collection of open-source preference datasets, including HH-RLHF, Stanford Human Preferences Dataset (SHP), and HelpSteer. |
|
- **Ultrafeedback Dataset**: Used for sampling prompts during the PPO training routine. |
|
|
|
--- |
|
|
|
## Base Model |
|
|
|
The *phi-instruct-segment-ppo* model is fine-tuned from **microsoft/Phi-3-mini-4k-instruct**. |
|
|
|
--- |
|
|
|
## Usage |
|
|
|
You can use this model directly with Hugging Face's Transformers library: |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
# Load model and tokenizer |
|
model_name = "yyqoni/Phi-3-mini-4k-segment-ppo-60k" |
|
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True) |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
|
# Input text |
|
input_text = "What are the benefits of using reinforcement learning in AI?" |
|
|
|
# Apply chat template formatting with generation prompt |
|
formatted_input = tokenizer.apply_chat_template( |
|
[{"role": "user", "content": input_text}], |
|
tokenize=False, |
|
add_generation_prompt=True |
|
) |
|
|
|
# Tokenize the formatted input |
|
inputs = tokenizer(formatted_input, return_tensors="pt", add_special_tokens=False) |
|
|
|
# Generate response |
|
outputs = model.generate(**inputs, max_new_tokens=50) |
|
|
|
# Decode and print the response |
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
``` |
|
|
|
--- |
|
|
|
## Citation |
|
|
|
If you find this model or our research useful, please consider citing our paper: |
|
|
|
```bibtex |
|
@misc{yin2025segmentingtextlearningrewards, |
|
title={Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model}, |
|
author={Yueqin Yin and Shentao Yang and Yujia Xie and Ziyi Yang and Yuting Sun and Hany Awadalla and Weizhu Chen and Mingyuan Zhou}, |
|
year={2025}, |
|
eprint={2501.02790}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2501.02790}, |
|
} |
|
``` |