yyqoni
/

rlhflow-llama-3-sft-8b-v2-bandit-ppo-60k

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

yyqoni commited on 11 days ago

Commit

4724aa2

·

verified ·

1 Parent(s): 88f0fdb

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -7,4 +7,4 @@ base_model:
 - RLHFlow/LLaMA3-SFT-v2
 ---
-This is the token-wise reward based ppo model introduced in the preprint **Segmenting Text and Learning Their Rewards for Improved RLHF in Language Models** (https://arxiv.org/abs/2501.02790). For more details, please visit our repository at https://github.com/yinyueqin/DenseRewardRLHF-PPO.

 - RLHFlow/LLaMA3-SFT-v2
 ---
+This is the bandit reward based ppo model introduced in the preprint **Segmenting Text and Learning Their Rewards for Improved RLHF in Language Models** (https://arxiv.org/abs/2501.02790). For more details, please visit our repository at https://github.com/yinyueqin/DenseRewardRLHF-PPO.