Real-world multi-modal problems are rarely solved by a single machine learning model, and often require multi-step computational plans that involve stitching several models. Tool-augmented LLMs hold tremendous promise for automating the generation of such computational plans. However, the lack of standardized benchmarks for evaluating LLMs as planners for multi-step multi-modal tasks has prevented a systematic study of planner design decisions. Should LLMs generate a full plan in a single shot or step-by-step? Should they invoke tools directly with Python code or through structured data formats like JSON? Does feedback improve planning? To answer these questions and more, we introduce m&m's: a benchmark containing 4K+ multi-step multi-modal tasks involving 33 tools that include multi-modal models, (free) public APIs, and image processing modules. For each of these task queries, we provide automatically generated plans using this realistic toolset. We further provide a high-quality subset of 1,565 task plans that are human-verified and correctly executable. With m&m's, we evaluate 6 popular LLMs with 2 planning strategies (multi-step vs. step-by-step planning), 2 plan formats (JSON vs. code), and 3 types of feedback (parsing/verification/execution). Finally, we summarize takeaways from our extensive experiments. Our dataset and code are available on HuggingFace (https://huggingface.co/datasets/zixianma/mnms) and Github (https://github.com/RAIVNLab/mnms).
翻译:现实世界中的多模态问题很少由单一机器学习模型解决,通常需要涉及多个模型拼接的多步骤计算方案。工具增强型大语言模型(LLMs)在自动化生成此类计算方案方面展现出巨大潜力。然而,由于缺乏标准化基准来评估LLMs作为多步骤多模态任务规划器的能力,导致对规划器设计决策的系统性研究受到阻碍:LLMs应一次性生成完整计划还是逐步生成?应通过Python代码直接调用工具,还是使用JSON等结构化数据格式?反馈机制能否提升规划能力?为回答这些问题及其他相关疑问,我们提出m&m's基准:包含4000余个多步骤多模态任务,涉及33种工具(涵盖多模态模型、免费公共API及图像处理模块)。针对每个任务查询,我们均利用该现实工具集自动生成计划,并提供由人工验证且可正确执行的1565个高质量规划子集。借助m&m's基准,我们评估了6种主流LLMs在2种规划策略(多步骤规划与逐步规划)、2种规划格式(JSON与代码)及3种反馈类型(解析/验证/执行)下的表现。最终,我们从大量实验中总结出关键结论。数据集及代码已发布至HuggingFace(https://huggingface.co/datasets/zixianma/mnms)和GitHub(https://github.com/RAIVNLab/mnms)。