Driven by curiosity, humans have continually sought to explore and understand the world around them, leading to the invention of various tools to satiate this inquisitiveness. Despite not having the capacity to process and memorize vast amounts of information in their brains, humans excel in critical thinking, planning, reflection, and harnessing available tools to interact with and interpret the world, enabling them to find answers efficiently. The recent advancements in large language models (LLMs) suggest that machines might also possess the aforementioned human-like capabilities, allowing them to exhibit powerful abilities even with a constrained parameter count. In this paper, we introduce KwaiAgents, a generalized information-seeking agent system based on LLMs. Within KwaiAgents, we propose an agent system that employs LLMs as its cognitive core, which is capable of understanding a user's query, behavior guidelines, and referencing external documents. The agent can also update and retrieve information from its internal memory, plan and execute actions using a time-aware search-browse toolkit, and ultimately provide a comprehensive response. We further investigate the system's performance when powered by LLMs less advanced than GPT-4, and introduce the Meta-Agent Tuning (MAT) framework, designed to ensure even an open-sourced 7B or 13B model performs well among many agent systems. We exploit both benchmark and human evaluations to systematically validate these capabilities. Extensive experiments show the superiority of our agent system compared to other autonomous agents and highlight the enhanced generalized agent-abilities of our fine-tuned LLMs.
翻译:受好奇心驱动,人类不断探索和理解周围世界,并发明各种工具来满足这种求知欲。尽管人类大脑无法处理和记忆海量信息,但人类擅长批判性思维、规划、反思以及利用现有工具与外界互动并解读世界,从而高效地寻找答案。大语言模型的最新进展表明,机器也可能具备上述类似人类的能力,即使参数规模受限,也能展现出强大的性能。本文提出KwaiAgents——一种基于大语言模型的通用信息检索智能体系统。在该系统中,我们采用大语言模型作为认知核心,使智能体能够理解用户查询、行为准则并参考外部文档。该智能体还能从内部存储中更新和检索信息,利用时敏搜索浏览工具集规划并执行操作,最终提供全面的反馈。我们进一步研究了当使用性能低于GPT-4的大语言模型时系统的表现,并提出了元智能体微调(MAT)框架,旨在确保即使开源的7B或13B模型也能在众多智能体系统中表现优异。通过基准测试和人工评估,我们系统性地验证了这些能力。大量实验表明,我们的智能体系统优于其他自主智能体,并突出了微调后的大语言模型在通用智能体能力上的显著提升。