On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting

Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs). Existing approaches that integrate SFT and RL often face the risk of disrupting established response patterns and inducing overfitting to expert data. To address this, we present a novel investigation into the unified view of SFT and RL through an off-policy versus on-policy lens. We propose CHORD, a framework for Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting, which reframes SFT not as a separate stage but as a dynamically weighted auxiliary objective within the on-policy RL process. Based on an analysis of off-policy expert data's influence at both holistic and granular levels, we incorporate a dual-control mechanism in CHORD. Specifically, the framework first employs a global coefficient to holistically guide the transition from off-policy imitation to on-policy exploration, and then applies a token-wise weighting function that enables granular learning from the expert, which promotes on-policy exploration and mitigates disruption from off-policy data. We conduct extensive experiments across various practical tasks, providing empirical evidence that CHORD achieves a stable and efficient learning process. By effectively harmonizing off-policy expert data with on-policy exploration, CHORD demonstrates significant improvements over baselines. We release the implementation at https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord to inspire further research.

翻译：监督微调（SFT）与强化学习（RL）是优化大型语言模型（LLMs）能力并校准其行为的两种重要后训练范式。现有整合SFT与RL的方法常面临破坏已有响应模式及对专家数据过拟合的风险。为解决这一问题，本文通过离策略与策略的视角提出对SFT与RL统一框架的新探索。我们提出CHORD框架——一种通过动态权重实现策略与离策略强化学习可控协调的框架，该框架将SFT重新定义为策略RL过程中动态加权的辅助目标，而非独立阶段。基于对离策略专家数据在整体与细粒度层面影响的分析，我们在CHORD中引入双重控制机制。具体而言，该框架首先采用全局系数整体引导从离策略模仿向策略探索的过渡，随后应用基于词元的加权函数实现从专家数据的细粒度学习，从而促进策略探索并减轻离策略数据的干扰。我们在多种实际任务上进行了广泛实验，实证表明CHORD能够实现稳定高效的学习过程。通过有效协调离策略专家数据与策略探索，CHORD相较于基线模型展现出显著提升。我们在https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord 公开了实现代码，以促进后续研究。