To encourage diverse exploration in reinforcement learning (RL) for large language models (LLMs) without compromising accuracy, we propose Policy Split, a novel paradigm that bifurcates the policy into normal and high-entropy modes with a high-entropy prompt. While sharing model parameters, the two modes undergo collaborative dual-mode entropy regularization tailored to distinct objectives. Specifically, the normal mode optimizes for task correctness, while the high-entropy mode incorporates a preference for exploration, and the two modes learn collaboratively. Extensive experiments demonstrate that our approach consistently outperforms established entropy-guided RL baselines across various model sizes in general and creative tasks. Further analysis reveals that Policy Split facilitates dual-mode exploration, where the high-entropy mode generates distinct behavioral patterns to the normal mode, providing unique learning signals.
翻译:为在不牺牲准确性的前提下促进大语言模型(LLMs)强化学习(RL)中的多样化探索,我们提出Policy Split这一新型范式。该方法通过高熵提示将策略分化为正常模式与高熵模式。两种模式共享模型参数,同时针对各自目标进行协作式的双模态熵正则化训练:正常模式优化任务正确性,高熵模式融入探索偏好,二者协同学习。大量实验表明,在通用任务与创意任务中,我们的方法在不同模型规模下均持续优于既有熵引导RL基线。进一步分析揭示,Policy Split通过促进双模态探索,使高熵模式产生区别于正常模式的独特行为模式,从而提供独特的训练信号。