The advancement of large language models (LLMs) prompts the development of multi-modal agents, which are used as a controller to call external tools, providing a feasible way to solve practical tasks. In this paper, we propose a multi-modal agent tuning method that automatically generates multi-modal tool-usage data and tunes a vision-language model (VLM) as the controller for powerful tool-usage reasoning. To preserve the data quality, we prompt the GPT-4o mini model to generate queries, files, and trajectories, followed by query-file and trajectory verifiers. Based on the data synthesis pipeline, we collect the MM-Traj dataset that contains 20K tasks with trajectories of tool usage. Then, we develop the T3-Agent via \underline{T}rajectory \underline{T}uning on VLMs for \underline{T}ool usage using MM-Traj. Evaluations on the GTA and GAIA benchmarks show that the T3-Agent consistently achieves improvements on two popular VLMs: MiniCPM-V-8.5B and {Qwen2-VL-7B}, which outperforms untrained VLMs by $20\%$, showing the effectiveness of the proposed data synthesis pipeline, leading to high-quality data for tool-usage capabilities.
翻译:大型语言模型(LLM)的进步推动了多模态智能体的发展,这类智能体可作为控制器调用外部工具,为解决实际任务提供了可行途径。本文提出一种多模态智能体调优方法,该方法能自动生成多模态工具使用数据,并将视觉语言模型(VLM)调优为具备强大工具使用推理能力的控制器。为保障数据质量,我们提示GPT-4o mini模型生成查询、文件及执行轨迹,并通过查询-文件验证器与轨迹验证器进行校验。基于该数据合成流程,我们构建了包含2万个任务及其工具使用轨迹的MM-Traj数据集。随后,我们利用MM-Traj数据,通过面向工具使用的轨迹调优方法,开发了T3-Agent。在GTA和GAIA基准测试上的评估表明,T3-Agent在两种主流VLM模型——MiniCPM-V-8.5B与Qwen2-VL-7B上均取得稳定提升,其性能较未调优VLM提高$20\%$,这证明了所提数据合成流程能有效生成高质量工具使用能力训练数据。