GRPO-style reinforcement learning (RL)-based LLM fine-tuning algorithms have recently gained popularity. Relying on heuristic trust-region approximations, however, they can lead to brittle optimization behavior, as global importance-ratio clipping and group-wise normalization fail to regulate samples whose importance ratios fall outside the clipping range. We propose Query-Adaptive Trust-Region policy Optimization (QUATRO), which directly enforces trust-region constraints through a principled optimization. This yields a clear and interpretable objective that enables explicit control over policy updates and stable, entropy-controlled optimization, with a stabilizer terms arising intrinsically from the exact trust-region formulation. Empirically verified on diverse mathematical reasoning benchmarks, QUATRO shows stable training under increased policy staleness and aggressive learning rates, maintaining well-controlled entropy throughout training.
翻译:基于GRPO风格强化学习的大语言模型微调算法近来广受关注。然而,这些方法依赖启发式的信赖域近似,可能导致脆弱的优化行为,因为全局重要性比率裁剪和组归一化无法有效调节那些重要性比率超出裁剪范围的样本。我们提出了查询自适应信赖域策略优化,该方法通过原理性优化直接强制执行信赖域约束。这产生了一个清晰且可解释的目标函数,能够实现对策略更新的显式控制以及稳定的、熵调控的优化过程,其中稳定项直接源于精确的信赖域公式。在多样化的数学推理基准测试上的实证验证表明,QUATRO在策略陈旧性增加和激进学习率设置下均能保持稳定训练,并在整个训练过程中维持良好受控的熵。