Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning

Multimodal agents, which integrate a controller e.g., a vision language model) with external tools, have demonstrated remarkable capabilities in tackling complex multimodal tasks. Existing approaches for training these agents, both supervised fine-tuning and reinforcement learning, depend on extensive human-annotated task-answer pairs and tool trajectories. However, for complex multimodal tasks, such annotations are prohibitively expensive or impractical to obtain. In this paper, we propose an iterative tool usage exploration method for multimodal agents without any pre-collected data, namely SPORT, via step-wise preference optimization to refine the trajectories of tool usage. Our method enables multimodal agents to autonomously discover effective tool usage strategies through self-exploration and optimization, eliminating the bottleneck of human annotation. SPORT has four iterative components: task synthesis, step sampling, step verification, and preference tuning. We first synthesize multimodal tasks using language models. Then, we introduce a novel trajectory exploration scheme, where step sampling and step verification are executed alternately to solve synthesized tasks. In step sampling, the agent tries different tools and obtains corresponding results. In step verification, we employ a verifier to provide AI feedback to construct step-wise preference data. The data is subsequently used to update the controller for tool usage through preference tuning, producing a SPORT agent. By interacting with real environments, the SPORT agent gradually evolves into a more refined and capable system. Evaluation in the GTA and GAIA benchmarks shows that the SPORT agent achieves 6.41% and 3.64% improvements, underscoring the generalization and effectiveness introduced by our method. The project page is https://SPORT-Agents.github.io.

翻译：多模态智能体，即集成控制器（如视觉语言模型）与外部工具的系统，在处理复杂多模态任务方面展现出卓越能力。现有的训练方法（包括监督微调和强化学习）依赖大量人工标注的任务答案对及工具轨迹数据。然而对于复杂多模态任务而言，此类标注成本极高且难以获取。本文提出一种无需预收集数据的多模态智能体工具迭代探索方法SPORT，通过逐步偏好优化来精炼工具使用轨迹。该方法使多模态智能体能够通过自我探索与优化自主发现有效工具使用策略，彻底摆脱人工标注瓶颈。SPORT包含四个迭代组件：任务合成、步骤采样、步骤验证与偏好调优。首先利用语言模型合成多模态任务，随后引入新型轨迹探索方案（交替执行步骤采样与步骤验证）以解决合成任务。在步骤采样阶段，智能体尝试不同工具并获取对应结果；在步骤验证阶段，我们采用验证器提供AI反馈构建逐步偏好数据。该数据后续通过偏好调优更新控制器对工具的使用策略，生成SPORT智能体。通过与真实环境交互，SPORT智能体逐步进化为更完善、更强大的系统。在GTA与GAIA基准测试中，SPORT智能体分别提升6.41%和3.64%，充分证明了本方法的泛化能力与有效性。项目页面：https://SPORT-Agents.github.io。