Reinforcement learning (RL) has become a standard paradigm for post-training and aligning Large Language Models (LLMs), yet recent evidence suggests it faces a persistent "capability ceiling": unlike classical RL systems that discover novel strategies, RL for LLMs often acts as a mere refiner of patterns already latent in pre-trained weights. In this work, we identify a fundamental structural bottleneck: while classical RL relies on compact, informative Markov states, current LLM post-training formulations are tethered to an ever-expanding history of actions. We revisit a classical principle long central to RL yet absent from LLM post-training: explicit Markov states. Theoretically, we provide rigorous guarantees demonstrating that leveraging estimated Markov states can significantly reduce sample complexity. Empirically, we show that introducing Markov states consistently breaks the performance boundaries of standard RL post-training across a suite of complex logic puzzles. Our findings suggest that moving beyond "history-as-state" modeling in favor of structured Markovian representations is essential for unlocking open-ended discovery and genuinely new reasoning capabilities in Generative AI.
翻译:强化学习已成为大语言模型后训练与对齐的标准范式,但近期证据表明其面临持续存在的"能力上限":与传统强化学习能发现创新策略不同,针对大语言模型的强化学习往往仅能优化预训练权重中已隐现的模式。本文识别出根本性的结构瓶颈:经典强化学习依赖紧凑、信息完备的马尔可夫状态,而当前大语言模型后训练的公式体系仍受困于不断扩展的历史行动序列。我们重新审视一个长期存在于强化学习核心却在大语言模型后训练中缺席的经典原理——显式马尔可夫状态。理论层面,我们给出严格保证,证明利用估计的马尔可夫状态可显著降低样本复杂度。实验层面,我们展示出在复杂逻辑谜题套件中,引入马尔可夫状态能持续突破标准强化学习后训练的性能边界。研究结果表明,转向结构化马尔可夫表征以超越"历史即状态"建模,对于在生成式AI中释放开放式发现与真正新颖推理能力至关重要。