Real-world multi-modal problems are rarely solved by a single machine learning model, and often require multi-step computational plans that involve stitching several models. Tool-augmented LLMs hold tremendous promise for automating the generation of such computational plans. However, the lack of standardized benchmarks for evaluating LLMs as planners for multi-step multi-modal tasks has prevented a systematic study of planner design decisions. Should LLMs generate a full plan in a single shot or step-by-step? Should they invoke tools directly with Python code or through structured data formats like JSON? Does feedback improve planning? To answer these questions and more, we introduce m&m's: a benchmark containing 4K+ multi-step multi-modal tasks involving 33 tools that include multi-modal models, (free) public APIs, and image processing modules. For each of these task queries, we provide automatically generated plans using this realistic toolset. We further provide a high-quality subset of 1,565 task plans that are human-verified and correctly executable. With m&m's, we evaluate 6 popular LLMs with 2 planning strategies (multi-step vs. step-by-step planning), 2 plan formats (JSON vs. code), and 3 types of feedback (parsing/verification/execution). Finally, we summarize takeaways from our extensive experiments. Our dataset and code are available on HuggingFace (https://huggingface.co/datasets/zixianma/mnms) and Github (https://github.com/RAIVNLab/mnms).
翻译:现实世界中的多模态问题很少通过单一机器学习模型解决,往往需要涉及多个模型拼接的多步骤计算方案。工具增强型大语言模型在自动化生成此类计算方案方面展现出巨大潜力。然而,由于缺乏评估大语言模型作为多步骤多模态任务规划器的标准化基准,系统性地研究规划器设计决策一直面临阻碍。大语言模型应该以单次生成完整方案还是逐步生成方案?应该通过Python代码直接调用工具,还是通过JSON等结构化数据格式?反馈机制能否提升规划质量?为解答这些问题及其他相关疑问,我们提出了m&m's:一个包含4K+个多步骤多模态任务的基准测试,涉及包括多模态模型、(免费)公共API和图像处理模块在内的33种工具。针对每个任务查询,我们利用该真实工具集自动生成规划方案,并进一步提供包含1,565个经过人工验证且可正确执行的高质量任务方案子集。借助m&m's,我们评估了6种主流大语言模型在2种规划策略(单步骤规划与逐步规划)、2种规划格式(JSON与代码)和3类反馈机制(解析/验证/执行反馈)下的表现。最后,我们总结了从大量实验中获得的启示。本数据集及代码已在HuggingFace(https://huggingface.co/datasets/zixianma/mnms)和GitHub(https://github.com/RAIVNLab/mnms)上开源。