The reasoning capabilities of LLM (Large Language Model) are widely acknowledged in recent research, inspiring studies on tool learning and autonomous agents. LLM serves as the "brain" of agent, orchestrating multiple tools for collaborative multi-step task solving. Unlike methods invoking tools like calculators or weather APIs for straightforward tasks, multi-modal agents excel by integrating diverse AI models for complex challenges. However, current multi-modal agents neglect the significance of model selection: they primarily focus on the planning and execution phases, and will only invoke predefined task-specific models for each subtask, making the execution fragile. Meanwhile, other traditional model selection methods are either incompatible with or suboptimal for the multi-modal agent scenarios, due to ignorance of dependencies among subtasks arising by multi-step reasoning. To this end, we identify the key challenges therein and propose the $\textit{M}^3$ framework as a plug-in with negligible runtime overhead at test-time. This framework improves model selection and bolsters the robustness of multi-modal agents in multi-step reasoning. In the absence of suitable benchmarks, we create MS-GQA, a new dataset specifically designed to investigate the model selection challenge in multi-modal agents. Our experiments reveal that our framework enables dynamic model selection, considering both user inputs and subtask dependencies, thereby robustifying the overall reasoning process. Our code and benchmark: https://github.com/LINs-lab/M3.
翻译:近期研究广泛认可大语言模型(LLM)的推理能力,推动了工具学习与自主智能体领域的研究。LLM充当智能体的"大脑",协调多种工具完成多步骤协作任务求解。与调用计算器或天气API等工具处理简单任务的方法不同,多模态智能体通过整合多样化AI模型应对复杂挑战。然而,当前多模态智能体忽视了模型选择的重要性:它们主要关注规划与执行阶段,仅针对每个子任务调用预定义的专用模型,导致执行过程脆弱。同时,传统模型选择方法由于忽视多步骤推理中子任务间的依赖关系,难以兼容或优化多模态智能体场景。为此,我们识别了其中的关键挑战,并提出$\textit{M}^3$框架作为即插即用模块,在测试阶段仅引入可忽略的运行开销。该框架通过改进模型选择机制,增强了多模态智能体在多步骤推理中的稳健性。针对缺乏合适基准的问题,我们构建了MS-GQA数据集,专门用于研究多模态智能体中的模型选择挑战。实验表明,我们的框架能够同时考虑用户输入与子任务依赖关系,实现动态模型选择,从而增强整体推理过程的鲁棒性。代码与基准数据集地址:https://github.com/LINs-lab/M3。