---
library_name: transformers
license: mit
datasets:
- argilla/ultrafeedback-binarized-preferences-cleaned
base_model:
- microsoft/Phi-3-mini-4k-instruct
pipeline_tag: text-generation
---

# phi-instruct-segment-ppo Model Card

The *phi-instruct-segment-ppo* model introduces a segment-level reward model to improve reinforcement learning with human feedback (RLHF) in language models. This work builds upon the methods in our paper *[Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model](https://arxiv.org/abs/2501.02790)*.

---

## Method Illustration

Below is an illustration of the segment-based reward modeling method, showing how entropy thresholds are used for segmentation, integrating both the reward model and PPO training:

## Architecture
<div align=center>

![image/png](https://cdn-uploads.huggingface.co/production/uploads/605e8dfd5abeb13e714c4c18/xeGwtrpnx2bWFg5ZOHA7R.png)

</div>


---

## Model Overview

This approach redefines the granularity of RLHF training by:

- Assigning rewards to semantically complete text segments, defined based on entropy thresholds.
- Introducing techniques to stabilize RLHF training under dense, segment-level rewards.

Model checkpoints are available on [HuggingFace](https://huggingface.co/collections/yyqoni/denserewardrlhf-ppo-677d39b5521f1e366c196f14).

---

## Training Data

We utilize the following datasets in our training pipeline:

- **Preference-700K Dataset**: A diverse collection of open-source preference datasets, including HH-RLHF, Stanford Human Preferences Dataset (SHP), and HelpSteer.
- **Ultrafeedback Dataset**: Used for sampling prompts during the PPO training routine.

---

## Base Model

The *phi-instruct-segment-ppo* model is fine-tuned from **microsoft/Phi-3-mini-4k-instruct**.

---

## Usage

You can use this model directly with Hugging Face's Transformers library:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_name = "yyqoni/Phi-3-mini-4k-segment-ppo-60k"
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Input text
input_text = "What are the benefits of using reinforcement learning in AI?"

# Apply chat template formatting with generation prompt
formatted_input = tokenizer.apply_chat_template(
    [{"role": "user", "content": input_text}],
    tokenize=False,
    add_generation_prompt=True
)

# Tokenize the formatted input
inputs = tokenizer(formatted_input, return_tensors="pt", add_special_tokens=False)

# Generate response
outputs = model.generate(**inputs, max_new_tokens=50)

# Decode and print the response
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

## Citation

If you find this model or our research useful, please consider citing our paper:

```bibtex
@misc{yin2025segmentingtextlearningrewards,
      title={Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model},
      author={Yueqin Yin and Shentao Yang and Yujia Xie and Ziyi Yang and Yuting Sun and Hany Awadalla and Weizhu Chen and Mingyuan Zhou},
      year={2025},
      eprint={2501.02790},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.02790},
}
```