Reinforcement Learning from Verifiable Rewards (RLVR) trains large language models (LLMs) from sampled trajectories, making decoding strategy a core component of learning rather than a purely inference-time choice. Sampling temperature directly controls the exploration--exploitation trade-off by modulating policy entropy, yet existing methods rely on static values or heuristic adaptations that are decoupled from task-level rewards. We propose Introspective LLM, a hierarchical reinforcement learning framework that learns to control sampling temperature during generation. At each decoding step, the model selects a temperature based on its hidden state and samples the next token from the resulting distribution. Temperature and token policies are jointly optimized from downstream rewards using a coordinate ascent scheme. Experiments on mathematical reasoning benchmarks show that learned temperature policies outperform fixed and heuristic baselines, while exhibiting interpretable exploration behaviors aligned with reasoning uncertainty.
翻译:可验证奖励强化学习(RLVR)通过采样轨迹训练大语言模型(LLM),使解码策略成为学习的核心组成部分而非纯粹的推理时选择。采样温度通过调节策略熵直接控制探索-利用权衡,但现有方法依赖于静态值或与任务级奖励解耦的启发式调整。我们提出内省式大语言模型,一种分层强化学习框架,能够学习在生成过程中动态控制采样温度。在每个解码步骤中,模型根据其隐藏状态选择温度,并从所得分布中采样下一个词元。温度策略与词元策略通过坐标上升方案,基于下游奖励进行联合优化。在数学推理基准测试上的实验表明,学习得到的温度策略优于固定策略和启发式基线,同时展现出与推理不确定性相一致的、可解释的探索行为。