Offline Reinforcement Learning for LLM Multi-Step Reasoning Paper • 2412.16145 • Published 18 days ago • 38
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment Paper • 2405.19332 • Published May 29, 2024 • 15