Update README.md
Browse files
README.md
CHANGED
@@ -7,4 +7,4 @@ base_model:
|
|
7 |
- RLHFlow/LLaMA3-SFT-v2
|
8 |
---
|
9 |
|
10 |
-
This is the
|
|
|
7 |
- RLHFlow/LLaMA3-SFT-v2
|
8 |
---
|
9 |
|
10 |
+
This is the bandit reward based ppo model introduced in the preprint **Segmenting Text and Learning Their Rewards for Improved RLHF in Language Models** (https://arxiv.org/abs/2501.02790). For more details, please visit our repository at https://github.com/yinyueqin/DenseRewardRLHF-PPO.
|