Temperature is a crucial hyperparameter in large language models (LLMs), controlling the trade-off between exploration and exploitation during text generation. High temperatures encourage diverse but noisy outputs, while low temperatures produce focused outputs but may cause premature convergence. Yet static or heuristic temperature schedules fail to adapt to the dynamic demands of reinforcement learning (RL) throughout training, often limiting policy improvement. We propose Temperature Adaptive Meta Policy Optimization (TAMPO), a new framework that recasts temperature control as a learnable meta-policy. TAMPO operates through a hierarchical two-loop process. In the inner loop, the LLM policy is updated (e.g., using GRPO) with trajectories sampled at the temperature selected by the meta-policy. In the outer loop, meta-policy updates the distribution over candidate temperatures by rewarding those that maximize the likelihood of high-advantage trajectories. This trajectory-guided, reward-driven mechanism enables online adaptation without additional rollouts, directly aligning exploration with policy improvement. On five mathematical reasoning benchmarks, TAMPO outperforms baselines using fixed or heuristic temperatures, establishing temperature as an effective learnable meta-policy for adaptive exploration in LLM reinforcement learning. Accepted at ICLR 2026.
翻译:温度是大语言模型(LLM)中的一个关键超参数,它控制着文本生成过程中探索与利用之间的权衡。高温鼓励多样但嘈杂的输出,而低温则产生聚焦的输出,但可能导致过早收敛。然而,静态或启发式的温度调度方案无法适应整个强化学习(RL)训练过程中的动态需求,常常限制策略的改进。我们提出了温度自适应元策略优化(TAMPO),这是一个将温度控制重新定义为可学习元策略的新框架。TAMPO通过一个分层的双循环过程运作。在内循环中,LLM策略(例如使用GRPO)根据元策略选择的温度下采样的轨迹进行更新。在外循环中,元策略通过奖励那些能最大化高优势轨迹似然性的候选温度,来更新候选温度上的分布。这种轨迹引导、奖励驱动的机制实现了无需额外轨迹采样的在线适应,直接将探索与策略改进对齐。在五个数学推理基准测试中,TAMPO的表现优于使用固定或启发式温度的基线方法,从而确立了温度作为LLM强化学习中自适应探索的一种有效可学习元策略。已被ICLR 2026接收。