Vigneshwaran
's Collections
ORPO: Monolithic Preference Optimization without Reference Model
Paper
•
2403.07691
•
Published
•
64
sDPO: Don't Use Your Data All at Once
Paper
•
2403.19270
•
Published
•
40
Teaching Large Language Models to Reason with Reinforcement Learning
Paper
•
2403.04642
•
Published
•
46
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper
•
2404.07503
•
Published
•
29
Rho-1: Not All Tokens Are What You Need
Paper
•
2404.07965
•
Published
•
88
Learn Your Reference Model for Real Good Alignment
Paper
•
2404.09656
•
Published
•
82
Dataset Reset Policy Optimization for RLHF
Paper
•
2404.08495
•
Published
•
9
Insights into Alignment: Evaluating DPO and its Variants Across Multiple
Tasks
Paper
•
2404.14723
•
Published
•
10
RLHF Workflow: From Reward Modeling to Online RLHF
Paper
•
2405.07863
•
Published
•
66
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Paper
•
2405.11143
•
Published
•
35
Mixtures of Experts Unlock Parameter Scaling for Deep RL
Paper
•
2402.08609
•
Published
•
34
Scaling Laws for Reward Model Overoptimization in Direct Alignment
Algorithms
Paper
•
2406.02900
•
Published
•
11
Back to Basics: Revisiting REINFORCE Style Optimization for Learning
from Human Feedback in LLMs
Paper
•
2402.14740
•
Published
•
12
HelpSteer2: Open-source dataset for training top-performing reward
models
Paper
•
2406.08673
•
Published
•
16
Unpacking DPO and PPO: Disentangling Best Practices for Learning from
Preference Feedback
Paper
•
2406.09279
•
Published
•
2
Understanding the performance gap between online and offline alignment
algorithms
Paper
•
2405.08448
•
Published
•
14
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak
Supervision
Paper
•
2312.09390
•
Published
•
32
Theoretical guarantees on the best-of-n alignment policy
Paper
•
2401.01879
•
Published
Iterative Preference Learning from Human Feedback: Bridging Theory and
Practice for RLHF under KL-Constraint
Paper
•
2312.11456
•
Published
•
1
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
Paper
•
2304.06767
•
Published
•
2
Self-Play Preference Optimization for Language Model Alignment
Paper
•
2405.00675
•
Published
•
25
Regularizing Hidden States Enables Learning Generalizable Reward Model
for LLMs
Paper
•
2406.10216
•
Published
•
2
Scaling Laws for Reward Model Overoptimization
Paper
•
2210.10760
•
Published
AgentInstruct: Toward Generative Teaching with Agentic Flows
Paper
•
2407.03502
•
Published
•
51
Online Merging Optimizers for Boosting Rewards and Mitigating Tax in
Alignment
Paper
•
2405.17931
•
Published
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference
Learning
Paper
•
2405.00451
•
Published
Foundations of Reinforcement Learning and Interactive Decision Making
Paper
•
2312.16730
•
Published
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
Paper
•
2408.07199
•
Published
•
21
Disentangling Length from Quality in Direct Preference Optimization
Paper
•
2403.19159
•
Published
Imitating Language via Scalable Inverse Reinforcement Learning
Paper
•
2409.01369
•
Published
Contrastive Prefence Learning: Learning from Human Feedback without RL
Paper
•
2310.13639
•
Published
•
24
D2PO: Discriminator-Guided DPO with Response Evaluation Models
Paper
•
2405.01511
•
Published
Anchored Preference Optimization and Contrastive Revisions: Addressing
Underspecification in Alignment
Paper
•
2408.06266
•
Published
•
10
Training Language Models to Self-Correct via Reinforcement Learning
Paper
•
2409.12917
•
Published
•
136
The Perfect Blend: Redefining RLHF with Mixture of Judges
Paper
•
2409.20370
•
Published
•
4
HelpSteer2-Preference: Complementing Ratings with Preferences
Paper
•
2410.01257
•
Published
•
21
A Critical Evaluation of AI Feedback for Aligning Large Language Models
Paper
•
2402.12366
•
Published
•
3
Rewarding Progress: Scaling Automated Process Verifiers for LLM
Reasoning
Paper
•
2410.08146
•
Published
RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement
Learning
Paper
•
2410.02089
•
Published
•
12
SALSA: Soup-based Alignment Learning for Stronger Adaptation in RLHF
Paper
•
2411.01798
•
Published
•
8
OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks
with Reinforcement Fine-Tuning
Paper
•
2412.16849
•
Published
•
7