m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks

Real-world multi-modal problems are rarely solved by a single machine learning model, and often require multi-step computational plans that involve stitching several models. Tool-augmented LLMs hold tremendous promise for automating the generation of such computational plans. However, the lack of standardized benchmarks for evaluating LLMs as planners for multi-step multi-modal tasks has prevented a systematic study of planner design decisions. Should LLMs generate a full plan in a single shot or step-by-step? Should they invoke tools directly with Python code or through structured data formats like JSON? Does feedback improve planning? To answer these questions and more, we introduce m&m's: a benchmark containing 4K+ multi-step multi-modal tasks involving 33 tools that include multi-modal models, (free) public APIs, and image processing modules. For each of these task queries, we provide automatically generated plans using this realistic toolset. We further provide a high-quality subset of 1,565 task plans that are human-verified and correctly executable. With m&m's, we evaluate 6 popular LLMs with 2 planning strategies (multi-step vs. step-by-step planning), 2 plan formats (JSON vs. code), and 3 types of feedback (parsing/verification/execution). Finally, we summarize takeaways from our extensive experiments. Our dataset and code are available on HuggingFace (https://huggingface.co/datasets/zixianma/mnms) and Github (https://github.com/RAIVNLab/mnms).

翻译：现实世界中的多模态问题很少通过单一机器学习模型解决，往往需要涉及多个模型拼接的多步骤计算方案。工具增强型大语言模型在自动化生成此类计算方案方面展现出巨大潜力。然而，由于缺乏评估大语言模型作为多步骤多模态任务规划器的标准化基准，系统性地研究规划器设计决策一直面临阻碍。大语言模型应该以单次生成完整方案还是逐步生成方案？应该通过Python代码直接调用工具，还是通过JSON等结构化数据格式？反馈机制能否提升规划质量？为解答这些问题及其他相关疑问，我们提出了m&m's：一个包含4K+个多步骤多模态任务的基准测试，涉及包括多模态模型、（免费）公共API和图像处理模块在内的33种工具。针对每个任务查询，我们利用该真实工具集自动生成规划方案，并进一步提供包含1,565个经过人工验证且可正确执行的高质量任务方案子集。借助m&m's，我们评估了6种主流大语言模型在2种规划策略（单步骤规划与逐步规划）、2种规划格式（JSON与代码）和3类反馈机制（解析/验证/执行反馈）下的表现。最后，我们总结了从大量实验中获得的启示。本数据集及代码已在HuggingFace（https://huggingface.co/datasets/zixianma/mnms）和GitHub（https://github.com/RAIVNLab/mnms）上开源。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日