VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

Recent advancements extend Multimodal Large Language Models (MLLMs) beyond standard visual question answering to utilizing external tools for advanced visual tasks. Despite this progress, precisely executing and effectively composing diverse tools for complex tasks remain persistent bottleneck. Constrained by sparse tool-sets and simple tool-use trajectories, existing benchmarks fail to capture complex and diverse tool interactions, falling short in evaluating model performance under practical, real-world conditions. To bridge this gap, we introduce VisualToolChain-Bench(VTC-Bench), a comprehensive benchmark designed to evaluate tool-use proficiency in MLLMs. To align with realistic computer vision pipelines, our framework features 32 diverse OpenCV-based visual operations. This rich tool-set enables extensive combinations, allowing VTC-Bench to rigorously assess multi-tool composition and long-horizon, multi-step plan execution. For precise evaluation, we provide 680 curated problems structured across a nine-category cognitive hierarchy, each with ground-truth execution trajectories. Extensive experiments on 19 leading MLLMs reveal critical limitations in current models' visual agentic capabilities. Specifically, models struggle to adapt to diverse tool-sets and generalize to unseen operations, with the leading model Gemini-3.0-Pro only achieving 51% on our benchmark. Furthermore, multi-tool composition remains a persistent challenge. When facing complex tasks, models struggle to formulate efficient execution plans, relying heavily on a narrow, suboptimal subset of familiar functions rather than selecting the optimal tools. By identifying these fundamental challenges, VTC-Bench establishes a rigorous baseline to guide the development of more generalized visual agentic models.

翻译：近期进展将多模态大语言模型（MLLMs）从标准视觉问答扩展至利用外部工具完成高级视觉任务。尽管取得进展，但精准执行和有效组合多样化工具处理复杂任务仍是持久瓶颈。受限于稀疏工具集和简单工具使用轨迹，现有基准未能捕捉复杂多样的工具交互，无法在真实条件下评估模型性能。为填补这一空白，我们提出VisualToolChain-Bench（VTC-Bench），一个全面评估MLLMs工具使用能力的基准。为对齐真实计算机视觉流程，我们的框架包含32种基于OpenCV的多样化视觉操作。这一丰富工具集支持广泛组合，使VTC-Bench能够严格评估多工具组合及长时多步计划执行。为精准评估，我们提供680个精心设计的问题，按九级认知层次结构组织，并附带真实执行轨迹。对19个领先MLLMs的广泛实验揭示了当前模型视觉智能能力的严重局限。具体而言，模型难以适应多样化工具集并泛化至未见过操作，最佳模型Gemini-3.0-Pro仅达到51%的准确率。此外，多工具组合仍是持续挑战。面对复杂任务，模型难以制定高效执行计划，过度依赖狭窄且次优的常用函数子集，而非选择最优工具。通过识别这些根本性挑战，VTC-Bench为开发更通用的视觉智能模型确立了严格基准。