m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks

Real-world multi-modal problems are rarely solved by a single machine learning model, and often require multi-step computational plans that involve stitching several models. Tool-augmented LLMs hold tremendous promise for automating the generation of such computational plans. However, the lack of standardized benchmarks for evaluating LLMs as planners for multi-step multi-modal tasks has prevented a systematic study of planner design decisions. Should LLMs generate a full plan in a single shot or step-by-step? Should they invoke tools directly with Python code or through structured data formats like JSON? Does feedback improve planning? To answer these questions and more, we introduce m&m's: a benchmark containing 4K+ multi-step multi-modal tasks involving 33 tools that include multi-modal models, (free) public APIs, and image processing modules. For each of these task queries, we provide automatically generated plans using this realistic toolset. We further provide a high-quality subset of 1,565 task plans that are human-verified and correctly executable. With m&m's, we evaluate 6 popular LLMs with 2 planning strategies (multi-step vs. step-by-step planning), 2 plan formats (JSON vs. code), and 3 types of feedback (parsing/verification/execution). Finally, we summarize takeaways from our extensive experiments. Our dataset and code are available on HuggingFace (https://huggingface.co/datasets/zixianma/mnms) and Github (https://github.com/RAIVNLab/mnms).

翻译：现实世界中的多模态问题很少由单一机器学习模型解决，通常需要涉及多个模型拼接的多步骤计算方案。工具增强型大语言模型（LLMs）在自动化生成此类计算方案方面展现出巨大潜力。然而，由于缺乏标准化基准来评估LLMs作为多步骤多模态任务规划器的能力，导致对规划器设计决策的系统性研究受到阻碍：LLMs应一次性生成完整计划还是逐步生成？应通过Python代码直接调用工具，还是使用JSON等结构化数据格式？反馈机制能否提升规划能力？为回答这些问题及其他相关疑问，我们提出m&m's基准：包含4000余个多步骤多模态任务，涉及33种工具（涵盖多模态模型、免费公共API及图像处理模块）。针对每个任务查询，我们均利用该现实工具集自动生成计划，并提供由人工验证且可正确执行的1565个高质量规划子集。借助m&m's基准，我们评估了6种主流LLMs在2种规划策略（多步骤规划与逐步规划）、2种规划格式（JSON与代码）及3种反馈类型（解析/验证/执行）下的表现。最终，我们从大量实验中总结出关键结论。数据集及代码已发布至HuggingFace（https://huggingface.co/datasets/zixianma/mnms）和GitHub（https://github.com/RAIVNLab/mnms）。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日