File size: 3,295 Bytes

d3f0e3f
 
2ef3386
 
 
 
 
 
d3f0e3f
 
2ef3386
d3f0e3f
2ef3386
d3f0e3f
2ef3386
d3f0e3f
2ef3386
d3f0e3f
2ef3386
d3f0e3f
2ef3386
 
d3f0e3f
2ef3386
d3f0e3f
2ef3386
d3f0e3f
 
 
2ef3386
d3f0e3f
2ef3386
d3f0e3f
2ef3386
d3f0e3f
2ef3386
 
d3f0e3f
2ef3386
d3f0e3f
2ef3386
d3f0e3f
2ef3386
d3f0e3f
2ef3386
d3f0e3f
2ef3386
 
d3f0e3f
2ef3386
d3f0e3f
2ef3386
d3f0e3f
2ef3386
d3f0e3f
2ef3386
d3f0e3f
2ef3386
d3f0e3f
2ef3386
d3f0e3f
2ef3386
 
d3f0e3f
2ef3386
 
 
 
d3f0e3f
2ef3386
 
d3f0e3f
2ef3386
 
 
 
 
 
d3f0e3f
2ef3386
 
d3f0e3f
2ef3386
 
d3f0e3f
2ef3386
 
 
d3f0e3f
2ef3386
d3f0e3f
2ef3386

---
library_name: transformers
license: mit
datasets:
- argilla/ultrafeedback-binarized-preferences-cleaned
base_model:
- microsoft/Phi-3-mini-4k-instruct
pipeline_tag: text-generation
---

# phi-instruct-segment-ppo Model Card

The *phi-instruct-segment-ppo* model introduces a segment-level reward model to improve reinforcement learning with human feedback (RLHF) in language models. This work builds upon the methods in our paper *[Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model](https://arxiv.org/abs/2501.02790)*.

---

## Method Illustration

Below is an illustration of the segment-based reward modeling method, showing how entropy thresholds are used for segmentation, integrating both the reward model and PPO training:

## Architecture
<div align=center>

![image/png](https://cdn-uploads.huggingface.co/production/uploads/605e8dfd5abeb13e714c4c18/xeGwtrpnx2bWFg5ZOHA7R.png)

</div>



---

## Model Overview

This approach redefines the granularity of RLHF training by:

- Assigning rewards to semantically complete text segments, defined based on entropy thresholds.
- Introducing techniques to stabilize RLHF training under dense, segment-level rewards.

Model checkpoints are available on [HuggingFace](https://huggingface.co/collections/yyqoni/denserewardrlhf-ppo-677d39b5521f1e366c196f14).

---

## Training Data

We utilize the following datasets in our training pipeline:

- **Preference-700K Dataset**: A diverse collection of open-source preference datasets, including HH-RLHF, Stanford Human Preferences Dataset (SHP), and HelpSteer.
- **Ultrafeedback Dataset**: Used for sampling prompts during the PPO training routine.

---

## Base Model

The *phi-instruct-segment-ppo* model is fine-tuned from **microsoft/Phi-3-mini-4k-instruct**.

---

## Usage

You can use this model directly with Hugging Face's Transformers library:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_name = "yyqoni/Phi-3-mini-4k-segment-ppo-60k"
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Input text
input_text = "What are the benefits of using reinforcement learning in AI?"

# Apply chat template formatting with generation prompt
formatted_input = tokenizer.apply_chat_template(
    [{"role": "user", "content": input_text}],
    tokenize=False,
    add_generation_prompt=True
)

# Tokenize the formatted input
inputs = tokenizer(formatted_input, return_tensors="pt", add_special_tokens=False)

# Generate response
outputs = model.generate(**inputs, max_new_tokens=50)

# Decode and print the response
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

## Citation

If you find this model or our research useful, please consider citing our paper:

```bibtex
@misc{yin2025segmentingtextlearningrewards,
      title={Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model},
      author={Yueqin Yin and Shentao Yang and Yujia Xie and Ziyi Yang and Yuting Sun and Hany Awadalla and Weizhu Chen and Mingyuan Zhou},
      year={2025},
      eprint={2501.02790},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.02790},
}
```