--- library_name: transformers license: mit datasets: - argilla/ultrafeedback-binarized-preferences-cleaned base_model: - microsoft/Phi-3-mini-4k-instruct pipeline_tag: text-generation --- # phi-instruct-segment-ppo Model Card The *phi-instruct-segment-ppo* model introduces a segment-level reward model to improve reinforcement learning with human feedback (RLHF) in language models. This work builds upon the methods in our paper *[Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model](https://arxiv.org/abs/2501.02790)*. --- ## Method Illustration Below is an illustration of the segment-based reward modeling method, showing how entropy thresholds are used for segmentation, integrating both the reward model and PPO training: ## Architecture
![image/png](https://cdn-uploads.huggingface.co/production/uploads/605e8dfd5abeb13e714c4c18/xeGwtrpnx2bWFg5ZOHA7R.png)
--- ## Model Overview This approach redefines the granularity of RLHF training by: - Assigning rewards to semantically complete text segments, defined based on entropy thresholds. - Introducing techniques to stabilize RLHF training under dense, segment-level rewards. Model checkpoints are available on [HuggingFace](https://huggingface.co/collections/yyqoni/denserewardrlhf-ppo-677d39b5521f1e366c196f14). --- ## Training Data We utilize the following datasets in our training pipeline: - **Preference-700K Dataset**: A diverse collection of open-source preference datasets, including HH-RLHF, Stanford Human Preferences Dataset (SHP), and HelpSteer. - **Ultrafeedback Dataset**: Used for sampling prompts during the PPO training routine. --- ## Base Model The *phi-instruct-segment-ppo* model is fine-tuned from **microsoft/Phi-3-mini-4k-instruct**. --- ## Usage You can use this model directly with Hugging Face's Transformers library: ```python from transformers import AutoModelForCausalLM, AutoTokenizer # Load model and tokenizer model_name = "yyqoni/Phi-3-mini-4k-segment-ppo-60k" model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained(model_name) # Input text input_text = "What are the benefits of using reinforcement learning in AI?" # Apply chat template formatting with generation prompt formatted_input = tokenizer.apply_chat_template( [{"role": "user", "content": input_text}], tokenize=False, add_generation_prompt=True ) # Tokenize the formatted input inputs = tokenizer(formatted_input, return_tensors="pt", add_special_tokens=False) # Generate response outputs = model.generate(**inputs, max_new_tokens=50) # Decode and print the response print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` --- ## Citation If you find this model or our research useful, please consider citing our paper: ```bibtex @misc{yin2025segmentingtextlearningrewards, title={Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model}, author={Yueqin Yin and Shentao Yang and Yujia Xie and Ziyi Yang and Yuting Sun and Hany Awadalla and Weizhu Chen and Mingyuan Zhou}, year={2025}, eprint={2501.02790}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2501.02790}, } ```