Reinforcement learning with verifiable reward (RLVR) has been instrumental in eliciting strong reasoning capabilities from large language models (LLMs) via long chains of thought (CoT). During RLVR training, we formalize and systemically study an empirical phenomenon whereby a multilingual model's CoT reverts to its dominant pre-training language (e.g., English) even when prompted in another language, which we term Cross-lingual Collapse. Because the long-CoT regime magnifies exposure to linguistic priors, the underlying trade-off between maximizing reasoning depth and preserving target-language fidelity has remained under-characterized. To examine this trade-off, we train LLMs with Group-Relative Policy Optimization (GRPO) on translated versions of math datasets widely used to elicit long-CoT reasoning. Throughout training, we track both task accuracy and the language consistency of reasoning chains. Our experiments yield three findings: (i) under RLVR, CoT in LLMs systematically drifts toward the pre-training dominant language as reasoning performance rises; (ii) English-centric priors, long-CoT GRPO optimization, task difficulty, and high-entropy decoding jointly amplify this drift, and the pattern persists beyond mathematics; and (iii) interventions that favor target-language traces--via a language-consistency reward, decoding-time controls, or more balanced backbones--mitigate collapse but reveal a persistent performance-fidelity trade-off.
翻译:基于可验证奖励的强化学习(RLVR)通过长链思维(CoT)机制,在激发大语言模型(LLMs)的强推理能力方面发挥了关键作用。在RLVR训练过程中,我们形式化并系统研究了一个经验现象:即使使用其他语言进行提示,多语言模型的CoT仍会回归到其主导预训练语言(如英语),我们将此现象称为跨语言坍缩。由于长链思维机制放大了语言先验的影响,最大化推理深度与保持目标语言忠实性之间的内在权衡尚未得到充分表征。为探究这一权衡,我们使用组相对策略优化(GRPO)方法,在广泛用于激发长链推理的数学数据集翻译版本上训练大语言模型。在整个训练过程中,我们同步追踪任务准确率与推理链的语言一致性。实验得出三项发现:(i)在RLVR训练下,随着推理性能提升,LLMs的CoT会系统性地向预训练主导语言漂移;(ii)英语中心先验、长链GRPO优化、任务难度及高熵解码共同加剧这种漂移,且该模式在数学领域之外依然存在;(iii)通过语言一致性奖励、解码时控制或更均衡的模型架构等倾向于目标语言轨迹的干预措施可缓解坍缩,但揭示了性能与忠实性之间持续存在的权衡关系。