Large language models (LLMs) have been shown to perform better when scaffolded into agents with memory, tools, and feedback. Beyond this, self-evolving agents have emerged, but current work largely limits adaptation to prompt rewriting or failure retries. Therefore, we present ALITA-G, a self-evolution framework that transforms a general-purpose agent into a domain expert by systematically generating, abstracting, and curating Model Context Protocol (MCP) tools. In this framework, a generalist agent executes a curated suite of target-domain tasks and synthesizes candidate MCPs from successful trajectories. These are then abstracted to parameterized primitives and consolidated into an MCP Box. At inference time, ALITA-G performs retrieval-augmented MCP selection with the help of each tool's descriptions and use cases, before executing an agent equipped with the MCP Executor. Across several benchmarks GAIA, PathVQA, and Humanity's Last Exam, ALITA-G attains strong gains while reducing computation costs. On GAIA validation, it achieves 83.03% pass@1 and 89.09% pass@3, establishing a new state-of-the-art result while reducing mean tokens per example by approximately 15% relative to a strong baseline agent. ALITA-G thus provides a principled pathway from generalist capability to reusable, domain-specific competence, improving both accuracy and efficiency on complex reasoning tasks.
翻译:大型语言模型(LLM)在配备记忆、工具与反馈机制的智能体框架中已展现出更优性能。在此基础上,自进化智能体逐渐兴起,但现有研究主要将适应机制局限于提示词重写或失败重试。为此,我们提出ALITA-G——一种通过系统化生成、抽象与精炼模型上下文协议(MCP)工具,将通用智能体转化为领域专家的自进化框架。该框架中,通用智能体执行经筛选的目标领域任务集,并从成功轨迹中合成候选MCP。这些协议随后被抽象为参数化原语,并整合至MCP工具箱。在推理阶段,ALITA-G借助各工具的描述与用例进行检索增强的MCP选择,随后启动搭载MCP执行器的智能体。在GAIA、PathVQA及Humanity's Last Exam等多个基准测试中,ALITA-G在降低计算成本的同时取得显著性能提升。在GAIA验证集上,其pass@1达到83.03%,pass@3达到89.09%,创造了新的最优结果,且每个示例的平均token消耗量较基线强智能体降低约15%。ALITA-G由此提供了一条从通用能力到可复用领域专长的方法路径,在提升复杂推理任务准确性的同时增强了执行效率。