Vision language models (VLMs) are increasingly deployed as controllers with access to external tools for complex reasoning and decision-making, yet their effectiveness remains limited by the scarcity of high-quality multimodal trajectories and the cost of manual annotation. We address this challenge with a vision-centric agent tuning framework that automatically synthesizes multimodal trajectories, generates step-wise preference pairs, and trains a VLM controller for robust tool-use reasoning. Our pipeline first constructs M-TRACE, a large-scale dataset of 28.5K multimodal tasks with 177K verified trajectories, enabling imitation-based trajectory tuning. Building on this, we develop MATRIX Agent, a controller finetuned on M-TRACE for step-wise tool reasoning. To achieve finer alignment, we further introduce Pref-X, a set of 11K automatically generated preference pairs, and optimize MATRIX on it via step-wise preference learning. Across three benchmarks, Agent-X, GTA, and GAIA, MATRIX consistently surpasses both open- and closed-source VLMs, demonstrating scalable and effective multimodal tool use. Our data and code is avaliable at https://github.com/mbzuai-oryx/MATRIX.
翻译:视觉语言模型(VLMs)正日益被部署为可访问外部工具的控制器,用于复杂推理与决策,但其效能仍受限于高质量多模态轨迹的稀缺性以及人工标注的高成本。本文提出一种以视觉为中心的智能体调优框架,通过自动合成多模态轨迹、生成分步偏好对,并训练VLM控制器以实现鲁棒的工具使用推理。我们的流程首先构建了M-TRACE——一个包含28.5万个多模态任务与17.7万条已验证轨迹的大规模数据集,支持基于模仿的轨迹调优。在此基础上,我们开发了MATRIX Agent,一个在M-TRACE上微调、用于分步工具推理的控制器。为实现更精细的对齐,我们进一步引入Pref-X——一组包含1.1万个自动生成的偏好对,并通过分步偏好学习对MATRIX进行优化。在Agent-X、GTA和GAIA三个基准测试中,MATRIX均持续超越开源与闭源的VLM模型,展现了可扩展且高效的多模态工具使用能力。我们的数据与代码公开于 https://github.com/mbzuai-oryx/MATRIX。