Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a unified policy, overlooking their internal mechanisms. In this paper, we decompose the LLM-based policy into Internal Layer Policies and Internal Modular Policies via Transformer's residual stream. Our entropy analysis on internal policy reveals distinct patterns: (1) universally, policies evolve from high-entropy exploration in early layers to deterministic refinement in top layers; and (2) Qwen exhibits a progressive, human-like reasoning structure, contrasting with the abrupt final-layer convergence in Llama. Furthermore, we discover that optimizing internal layers induces feature refinement, forcing lower layers to capture high-level reasoning representations early. Motivated by these findings, we propose Bottom-up Policy Optimization (BuPO), a novel RL paradigm that reconstructs the LLM's reasoning foundation from the bottom up by optimizing internal layers in early stages. Extensive experiments on complex reasoning benchmarks demonstrate the effectiveness of BuPO. Our code is available at https://github.com/Trae1ounG/BuPO.
翻译:现有强化学习方法将大语言模型视为统一策略,忽视了其内部机制。本文通过Transformer残差流将基于LLM的策略分解为内部层策略与内部模块化策略。对内部策略的熵分析揭示了两种显著模式:(1)普遍而言,策略从底层的高熵探索逐渐演变为顶层的确定性优化;(2)Qwen展现出渐进式类人推理结构,与Llama在最终层的突变式收敛形成鲜明对比。进一步研究发现,优化内部层会引发特征精炼,迫使底层更早捕获高层推理表征。基于这些发现,我们提出自底向上策略优化——一种通过早期阶段优化内部层、从底层重构LLM推理基础的新型强化学习范式。在复杂推理基准测试上的大量实验验证了BuPO的有效性。代码已开源:https://github.com/Trae1ounG/BuPO。