When Single-Agent with Skills Replace Multi-Agent Systems and When They Fail

Multi-agent AI systems have proven effective for complex reasoning. These systems are compounded by specialized agents, which collaborate through explicit communication, but incur substantial computational overhead. A natural question arises: can we achieve similar modularity benefits with a single agent that selects from a library of skills? We explore this question by viewing skills as internalized agent behaviors. From this perspective, a multi-agent system can be compiled into an equivalent single-agent system, trading inter-agent communication for skill selection. Our preliminary experiments suggest this approach can substantially reduce token usage and latency while maintaining competitive accuracy on reasoning benchmarks. However, this efficiency raises a deeper question that has received little attention: how does skill selection scale as libraries grow? Drawing on principles from cognitive science, we propose that LLM skill selection exhibits bounded capacity analogous to human decision-making. We investigate the scaling behavior of skill selection and observe a striking pattern. Rather than degrading gradually, selection accuracy remains stable up to a critical library size, then drops sharply, indicating a phase transition reminiscent of capacity limits in human cognition. Furthermore, we find evidence that semantic confusability among similar skills, rather than library size alone, plays a central role in this degradation. This perspective suggests that hierarchical organization, which has long helped humans manage complex choices, may similarly benefit AI systems. Our initial results with hierarchical routing support this hypothesis. This work opens new questions about the fundamental limits of semantic-based skill selection in LLMs and offers a cognitive-grounded framework and practical guidelines for designing scalable skill-based agents.

翻译：多智能体人工智能系统已被证明在复杂推理任务中具有显著效能。此类系统由专业化智能体构成，通过显式通信实现协作，但需承担可观的计算开销。一个自然的问题随之产生：能否通过具备技能库选择能力的单智能体实现类似的模块化优势？本研究通过将技能视为内化的智能体行为来探讨该问题。基于此视角，多智能体系统可被编译为等效的单智能体系统，将智能体间通信转化为技能选择机制。初步实验表明，该方法能在保持推理基准测试竞争力的同时，显著降低令牌使用量与延迟。然而，这种效率优势引出了一个尚未得到充分关注的根本性问题：技能选择机制如何随技能库规模扩展而演化？借鉴认知科学原理，我们提出大语言模型的技能选择存在类似于人类决策的有限容量边界。通过探究技能选择的扩展规律，我们观察到一种显著模式：选择精度并非逐渐衰减，而是在达到临界技能库规模前保持稳定，随后急剧下降，呈现出类似人类认知容量限制的相变现象。进一步研究发现，语义相似技能间的混淆性（而非单纯库容）是导致性能衰退的核心因素。这一视角表明，长期助力人类管理复杂决策的层级化组织架构，同样可能惠及人工智能系统。我们在层级化路由机制中的初步实验结果支持该假设。本研究开启了大语言模型中基于语义的技能选择机制存在根本性局限的新思考，并为设计可扩展的技能驱动型智能体提供了认知科学基础的理论框架与实践指导。