Tool-using agents based on Large Language Models (LLMs) excel in tasks such as mathematical reasoning and multi-hop question answering. However, in long trajectories, agents often trigger excessive and low-quality tool calls, increasing latency and degrading inference performance, making managing tool-use behavior challenging. In this work, we conduct entropy-based pilot experiments and observe a strong positive correlation between entropy reduction and high-quality tool calls. Building on this finding, we propose using entropy reduction as a supervisory signal and design two reward strategies to address the differing needs of optimizing tool-use behavior. Sparse outcome rewards provide coarse, trajectory-level guidance to improve efficiency, while dense process rewards offer fine-grained supervision to enhance performance. Experiments across diverse domains show that both reward designs improve tool-use behavior: the former reduces tool calls by 72.07% compared to the average of baselines, while the latter improves performance by 22.27%. These results position entropy reduction as a key mechanism for enhancing tool-use behavior, enabling agents to be more adaptive in real-world applications.
翻译:基于大型语言模型(LLM)的工具使用智能体在数学推理和多跳问答等任务中表现出色。然而,在长轨迹任务中,智能体常会触发过多且低质量的工具调用,导致延迟增加并降低推理性能,使得工具使用行为的管理颇具挑战。本研究通过基于熵的初步实验,观察到熵减少与高质量工具调用之间存在显著正相关。基于这一发现,我们提出将熵减少作为监督信号,并设计了两种奖励策略以应对优化工具使用行为的不同需求。稀疏结果奖励提供粗粒度的轨迹级指导以提高效率,而稠密过程奖励则提供细粒度监督以提升性能。跨多个领域的实验表明,两种奖励设计均能改善工具使用行为:前者相较于基线平均值减少了72.07%的工具调用,后者则将性能提升了22.27%。这些结果表明,熵减少是增强工具使用行为的关键机制,能使智能体在现实应用中更具适应性。