Tool-LMM: A Large Multi-Modal Model for Tool Agent Learning

Recently, the astonishing performance of large language models (LLMs) in natural language comprehension and generation tasks triggered lots of exploration of using them as central controllers to build agent systems. Multiple studies focus on bridging the LLMs to external tools to extend the application scenarios. However, the current LLMs' perceiving tool-use ability is limited to a single text query, which may result in ambiguity in understanding the users' real intentions. LLMs are expected to eliminate that by perceiving the visual- or auditory-grounded instructions' information. Therefore, in this paper, we propose Tool-LMM, a system incorporating open-source LLMs and multi-modal encoders so that the learnt LLMs can be conscious of multi-modal input instruction and then select the function-matched tool correctly. To facilitate the evaluation of the model's capability, we collect a dataset featured by consisting of multi-modal input tools from HuggingFace. Another important feature of our dataset is that our dataset also contains multiple potential choices for the same instruction due to the existence of identical functions and synonymous functions, which provides more potential solutions for the same query. The experiments reveal that our LMM is capable of recommending appropriate tools for multi-modal instructions. Codes and data are available at https://github.com/Tool-LMM/Tool-LMM.

翻译：近期，大型语言模型（LLMs）在自然语言理解与生成任务中展现出惊人性能，这促使大量研究探索将其作为核心控制器来构建智能体系统。多项研究聚焦于将LLMs与外部工具连接以扩展应用场景。然而，当前LLMs感知工具使用的能力局限于单一文本查询，这可能导致对用户真实意图的理解存在歧义。研究者期望通过让LLMs感知视觉或听觉导向指令中的信息来消除这一问题。为此，本文提出Tool-LMM系统，该系统整合了开源LLMs与多模态编码器，使经过学习的LLMs能够感知多模态输入指令并正确选择功能匹配的工具。为便于评估模型能力，我们收集了一个包含来自HuggingFace的多模态输入工具数据集。该数据集的另一重要特征是：由于存在功能相同与同义功能的情况，数据集中同一指令包含多个潜在选择，为相同查询提供了更多可行方案。实验表明，我们的多模态大模型（LMM）能够为多模态指令推荐合适的工具。代码与数据已开源至https://github.com/Tool-LMM/Tool-LMM。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日