The emergence of compositional reasoning in large language models through reinforcement learning with verifiable rewards (RLVR) has been a key driver of recent empirical successes. Despite this progress, it remains unclear which compositional problems are learnable in this setting using outcome-level feedback alone. In this work, we theoretically study the learnability of compositional problems in autoregressive models under RLVR training. We identify a quantity that we call the task-advantage ratio, a joint property of the compositional problem and the base model, that characterizes which tasks and compositions are learnable from outcome-level feedback. On the positive side, using this characterization, we show that compositional problems where correct intermediate steps provide a clear advantage are efficiently learnable with RLVR. We also analyze how such an advantage naturally arises in different problems. On the negative side, when the structural advantage is not present, RLVR may converge to suboptimal compositions. We prove that, in some cases, the quality of the base model determines if such an advantage exists and whether RLVR will converge to a suboptimal solution. We hope our analysis can provide a principled theoretical understanding of when and why RLVR succeeds and when it does not.
翻译:大型语言模型通过基于可验证奖励的强化学习(RLVR)实现组合推理能力的涌现,是近期实证成功的关键驱动因素。尽管取得了这些进展,仅凭结果层面的反馈,哪些组合问题在此设置下是可学习的仍不明确。本工作从理论上研究了自回归模型在RLVR训练下组合问题的可学习性。我们提出了一个称为任务优势比的量,它是组合问题与基础模型的联合属性,用于刻画哪些任务和组合能够从结果层面的反馈中习得。从积极方面看,利用这一特征,我们证明了当正确中间步骤能提供明确优势时,组合问题可通过RLVR高效学习。我们还分析了这种优势在不同问题中如何自然产生。从消极方面看,当结构优势不存在时,RLVR可能收敛至次优组合。我们证明在某些情况下,基础模型的质量决定了此类优势是否存在,以及RLVR是否会收敛到次优解。我们希望本分析能为理解RLVR何时及为何成功、何时失败提供原则性的理论依据。