Large language models (LLMs) have achieved remarkable progress in solving various natural language processing tasks due to emergent reasoning abilities. However, LLMs have inherent limitations as they are incapable of accessing up-to-date information (stored on the Web or in task-specific knowledge bases), using external tools, and performing precise mathematical and logical reasoning. In this paper, we present Chameleon, an AI system that mitigates these limitations by augmenting LLMs with plug-and-play modules for compositional reasoning. Chameleon synthesizes programs by composing various tools (e.g., LLMs, off-the-shelf vision models, web search engines, Python functions, and heuristic-based modules) for accomplishing complex reasoning tasks. At the heart of Chameleon is an LLM-based planner that assembles a sequence of tools to execute to generate the final response. We showcase the effectiveness of Chameleon on two multi-modal knowledge-intensive reasoning tasks: ScienceQA and TabMWP. Chameleon, powered by GPT-4, achieves an 86.54% overall accuracy on ScienceQA, improving the best published few-shot result by 11.37%. On TabMWP, GPT-4-powered Chameleon improves the accuracy by 17.0%, lifting the state of the art to 98.78%. Our analysis also shows that the GPT-4-powered planner exhibits more consistent and rational tool selection via inferring potential constraints from instructions, compared to a ChatGPT-powered planner.
翻译:大语言模型(LLMs)因涌现的推理能力在解决各类自然语言处理任务中取得了显著进展。然而,LLMs存在固有局限性,无法获取最新信息(存储于网络或特定任务知识库)、使用外部工具,以及完成精确的数学与逻辑推理。本文提出Chameleon——一种通过为LLMs配备即插即用模块实现组合式推理的人工智能系统。该系统通过组合多种工具(如LLMs、现成视觉模型、网络搜索引擎、Python函数及基于启发式的模块)来编排程序,以完成复杂推理任务。Chameleon的核心是基于LLM的规划器,它能够组装一系列待执行工具以生成最终响应。我们在两项多模态知识密集型推理任务——ScienceQA与TabMWP上验证了Chameleon的有效性。基于GPT-4驱动的Chameleon在ScienceQA上达到86.54%的整体准确率,较最佳已发表小样本结果提升11.37%;在TabMWP上准确率提升17.0%,将最优结果推高至98.78%。分析表明,与基于ChatGPT的规划器相比,GPT-4驱动的规划器能通过从指令中推断潜在约束,展现出更一致且理性的工具选择能力。