File size: 3,295 Bytes
d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 d3f0e3f 2ef3386 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
---
library_name: transformers
license: mit
datasets:
- argilla/ultrafeedback-binarized-preferences-cleaned
base_model:
- microsoft/Phi-3-mini-4k-instruct
pipeline_tag: text-generation
---
# phi-instruct-segment-ppo Model Card
The *phi-instruct-segment-ppo* model introduces a segment-level reward model to improve reinforcement learning with human feedback (RLHF) in language models. This work builds upon the methods in our paper *[Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model](https://arxiv.org/abs/2501.02790)*.
---
## Method Illustration
Below is an illustration of the segment-based reward modeling method, showing how entropy thresholds are used for segmentation, integrating both the reward model and PPO training:
## Architecture
<div align=center>
![image/png](https://cdn-uploads.huggingface.co/production/uploads/605e8dfd5abeb13e714c4c18/xeGwtrpnx2bWFg5ZOHA7R.png)
</div>
---
## Model Overview
This approach redefines the granularity of RLHF training by:
- Assigning rewards to semantically complete text segments, defined based on entropy thresholds.
- Introducing techniques to stabilize RLHF training under dense, segment-level rewards.
Model checkpoints are available on [HuggingFace](https://huggingface.co/collections/yyqoni/denserewardrlhf-ppo-677d39b5521f1e366c196f14).
---
## Training Data
We utilize the following datasets in our training pipeline:
- **Preference-700K Dataset**: A diverse collection of open-source preference datasets, including HH-RLHF, Stanford Human Preferences Dataset (SHP), and HelpSteer.
- **Ultrafeedback Dataset**: Used for sampling prompts during the PPO training routine.
---
## Base Model
The *phi-instruct-segment-ppo* model is fine-tuned from **microsoft/Phi-3-mini-4k-instruct**.
---
## Usage
You can use this model directly with Hugging Face's Transformers library:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model_name = "yyqoni/Phi-3-mini-4k-segment-ppo-60k"
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Input text
input_text = "What are the benefits of using reinforcement learning in AI?"
# Apply chat template formatting with generation prompt
formatted_input = tokenizer.apply_chat_template(
[{"role": "user", "content": input_text}],
tokenize=False,
add_generation_prompt=True
)
# Tokenize the formatted input
inputs = tokenizer(formatted_input, return_tensors="pt", add_special_tokens=False)
# Generate response
outputs = model.generate(**inputs, max_new_tokens=50)
# Decode and print the response
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
---
## Citation
If you find this model or our research useful, please consider citing our paper:
```bibtex
@misc{yin2025segmentingtextlearningrewards,
title={Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model},
author={Yueqin Yin and Shentao Yang and Yujia Xie and Ziyi Yang and Yuting Sun and Hany Awadalla and Weizhu Chen and Mingyuan Zhou},
year={2025},
eprint={2501.02790},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.02790},
}
``` |