dpo - a xieyuquan Collection

xieyuquan 's Collections

rlhf

arch

dpo

dpo

updated Aug 1, 2024

Bootstrapping Language Models with DPO Implicit Rewards

Paper • 2406.09760 • Published Jun 14, 2024 • 38
BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLM

Paper • 2406.12168 • Published Jun 18, 2024 • 7
WPO: Enhancing RLHF with Weighted Preference Optimization

Paper • 2406.11827 • Published Jun 17, 2024 • 14
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

Paper • 2406.18629 • Published Jun 26, 2024 • 41
Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

Paper • 2407.18248 • Published Jul 25, 2024 • 32
Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge

Paper • 2407.19594 • Published Jul 28, 2024 • 20