Learning to Recommend Multi-Agent Subgraphs from Calling Trees

Multi-agent systems (MAS) increasingly solve complex tasks by orchestrating agents and tools selected from rapidly growing marketplaces. As these marketplaces expand, many candidates become functionally overlapping, making selection not just a retrieval problem: beyond filtering relevant agents, an orchestrator must choose options that are reliable, compatible with the current execution context, and able to cooperate with other selected agents. Existing recommender systems -- largely built for item-level ranking from flat user-item logs -- do not directly address the structured, sequential, and interaction-dependent nature of agent orchestration. We address this gap by \textbf{formulating agent recommendation in MAS as a constrained decision problem} and introducing a generic \textbf{constrained recommendation framework} that first uses retrieval to build a compact candidate set conditioned on the current subtask and context, and then performs \textbf{utility optimization} within this feasible set using a learned scorer that accounts for relevance, reliability, and interaction effects. We ground both the formulation and learning signals in \textbf{historical calling trees}, which capture the execution structure of MAS (parent-child calls, branching dependencies, and local cooperation patterns) beyond what flat logs provide. The framework supports two complementary settings: \textbf{agent-level recommendation} (select the next agent/tool) and \textbf{system-level recommendation} (select a small, connected agent team/subgraph for coordinated execution). To enable systematic evaluation, we construct a unified calling-tree benchmark by normalizing invocation logs from eight heterogeneous multi-agent corpora into a shared structured representation.

翻译：多智能体系统（MAS）越来越多地通过编排从快速增长的市场中选择的智能体和工具来解决复杂任务。随着这些市场的扩张，许多候选者在功能上出现重叠，使得选择不再仅仅是检索问题：除了筛选相关智能体，编排器还必须选择可靠、与当前执行上下文兼容且能够与其他选定智能体协作的选项。现有的推荐系统——主要基于扁平的用户-物品日志进行物品级排序——并未直接处理智能体编排的结构化、序列化和交互依赖特性。我们通过**将MAS中的智能体推荐建模为约束决策问题**来填补这一空白，并引入一个通用的**约束推荐框架**：该框架首先利用检索构建一个基于当前子任务和上下文的紧凑候选集，然后在该可行集内使用一个考虑相关性、可靠性和交互效应的学习评分器进行**效用优化**。我们将该建模和学习信号均建立在**历史调用树**中，这些调用树捕获了MAS的执行结构（父子调用、分支依赖和局部协作模式），超越了扁平日志所能提供的信息。该框架支持两种互补的设置：**智能体级推荐**（选择下一个智能体/工具）和**系统级推荐**（选择一个小型、连通的智能体团队/子图以进行协调执行）。为了支持系统化评估，我们通过将来自八个异构多智能体语料库的调用日志归一化为共享的结构化表示，构建了一个统一的调用树基准。