We present M^3-Bench, the first benchmark for evaluating multimodal tool use under the Model Context Protocol. The benchmark targets realistic, multi-hop and multi-threaded workflows that require visual grounding and textual reasoning, cross-tool dependencies, and persistence of intermediate resources across steps. We introduce a similarity-driven alignment that serializes each tool call, embeds signatures with a sentence encoder, and performs similarity-bucketed Hungarian matching to obtain auditable one-to-one correspondences. On top of this alignment, we report interpretable metrics that decouple semantic fidelity from workflow consistency. The benchmark spans 28 servers with 231 tools, and provides standardized trajectories curated through an Executor & Judge pipeline with human verification; an auxiliary four large language models (LLMs) judge ensemble reports end-task Task Completion and information grounding. Evaluations of representative state-of-the-art Multimodal LLMs (MLLMs) reveal persistent gaps in multimodal MCP tool use, particularly in argument fidelity and structure consistency, underscoring the need for methods that jointly reason over images, text, and tool graphs. Our Benchmark's anonymous repository is at https://github.com/EtaYang10th/Open-M3-Bench
翻译:我们提出了M^3-Bench,这是首个在模型上下文协议(Model Context Protocol)下评估多模态工具使用的基准测试。该基准针对现实、多跳且多线程的工作流程,这些流程需要视觉基础与文本推理、跨工具依赖关系以及中间资源在步骤间的持久性。我们引入了一种基于相似性的对齐方法,该方法序列化每个工具调用,使用句子编码器嵌入签名,并执行基于相似性分桶的匈牙利匹配,以获得可审计的一对一对应关系。在此对齐基础上,我们报告了可解释的度量指标,将语义保真度与工作流一致性解耦。该基准涵盖28个服务器和231个工具,并通过一个经过人工验证的执行器与评判器(Executor & Judge)流程提供标准化的轨迹;一个辅助的由四个大型语言模型(LLMs)组成的评判集合报告最终任务的任务完成度和信息基础性。对代表性最先进的多模态大语言模型(MLLMs)的评估揭示了其在多模态MCP工具使用方面存在的持续差距,特别是在参数保真度和结构一致性方面,这凸显了需要能够联合推理图像、文本和工具图的方法。我们基准测试的匿名代码库位于 https://github.com/EtaYang10th/Open-M3-Bench。