基于语义多样性探索的强化高效推理 (Reinforced Efficient Reasoning via Semantically Diverse Exploration)

Reinforcement learning with verifiable rewards (RLVR) has proven effective in enhancing the reasoning of large language models (LLMs). Monte Carlo Tree Search (MCTS)-based extensions improve upon vanilla RLVR (e.g., GRPO) by providing tree-based reasoning rollouts that enable fine-grained and segment-level credit assignment. However, existing methods still suffer from limited exploration diversity and inefficient reasoning. To address the above challenges, we propose reinforced efficient reasoning via semantically diverse explorations, i.e., ROSE, for LLMs. To encourage more diverse reasoning exploration, our method incorporates a semantic-entropy-based branching strategy and an $\varepsilon$-exploration mechanism. The former operates on already sampled reasoning rollouts to capture semantic uncertainty and select branching points with high semantic divergence to generate new successive reasoning paths, whereas the latter stochastically initiates reasoning rollouts from the root, preventing the search process from becoming overly local. To improve efficiency, we design a length-aware segment-level advantage estimator that rewards concise and correct reasoning while penalizing unnecessarily long reasoning chains. Extensive experiments on various mathematical reasoning benchmarks with Qwen and Llama models validate the effectiveness and efficiency of ROSE. Codes are available at https://github.com/ZiqiZhao1/ROSE-rl.

翻译：具有可验证奖励的强化学习（RLVR）已被证明能有效增强大型语言模型（LLMs）的推理能力。基于蒙特卡洛树搜索（MCTS）的扩展方法（例如GRPO）通过提供基于树的推理推演，实现了细粒度和片段级的信用分配，从而改进了原始的RLVR。然而，现有方法仍存在探索多样性有限和推理效率低下的问题。为解决上述挑战，我们提出了基于语义多样性探索的强化高效推理方法，即ROSE，用于LLMs。为了鼓励更多样化的推理探索，我们的方法结合了基于语义熵的分支策略和$\varepsilon$探索机制。前者基于已采样的推理推演来捕捉语义不确定性，并选择具有高语义分歧的分支点以生成新的后续推理路径；而后者则随机地从根节点启动推理推演，防止搜索过程过度局部化。为了提高效率，我们设计了一种长度感知的片段级优势估计器，该估计器奖励简洁且正确的推理，同时惩罚不必要的冗长推理链。在Qwen和Llama模型上进行的各种数学推理基准测试的广泛实验验证了ROSE的有效性和高效性。代码可在https://github.com/ZiqiZhao1/ROSE-rl获取。