Large language models have demonstrated strong reasoning capabilities in complex tasks through tool integration, which is typically framed as a Markov Decision Process and optimized with trajectory-level RL algorithms such as GRPO. However, a common class of reasoning tasks, iterative optimization, presents distinct challenges: the agent interacts with the same underlying environment state across turns, and the value of a trajectory is determined by the best turn-level reward rather than cumulative returns. Existing GRPO-based methods cannot perform fine-grained, turn-level optimization in such settings, while black-box optimization methods discard prior knowledge and reasoning capabilities. To address this gap, we propose Turn-Level GRPO (TL-GRPO), a lightweight RL algorithm that performs turn-level group sampling for fine-grained optimization. We evaluate TL-GRPO on analog circuit sizing (ACS), a challenging scientific optimization task requiring multiple simulations and domain expertise. Results show that TL-GRPO outperforms standard GRPO and Bayesian optimization methods across various specifications. Furthermore, our 30B model trained with TL-GRPO achieves state-of-the-art performance on ACS tasks under same simulation budget, demonstrating both strong generalization and practical utility.
翻译:大语言模型通过工具集成,在复杂任务中展现出强大的推理能力,这通常被建模为马尔可夫决策过程,并使用轨迹级强化学习算法(如GRPO)进行优化。然而,一类常见的推理任务——迭代优化——提出了独特的挑战:智能体在多个回合中与相同的基础环境状态交互,且轨迹的价值由最佳的回合级奖励而非累积回报决定。现有的基于GRPO的方法无法在此类场景下执行细粒度的回合级优化,而黑盒优化方法则丢弃了先验知识和推理能力。为弥补这一空白,我们提出了回合级GRPO(TL-GRPO),一种轻量级强化学习算法,通过执行回合级分组采样以实现细粒度优化。我们在模拟电路尺寸调整(ACS)这一具有挑战性的科学优化任务上评估TL-GRPO,该任务需要多次仿真和领域专业知识。结果表明,TL-GRPO在各种规格下均优于标准GRPO和贝叶斯优化方法。此外,我们使用TL-GRPO训练的30B模型在相同仿真预算下,于ACS任务上取得了最先进的性能,展现了强大的泛化能力和实际应用价值。