The transition to open, distributed Multi-Agent Systems (MAS) promises scalable intelligence but introduces a non-trivial tension: maximizing global efficiency requires cooperative, resource-aware scheduling, yet autonomous agents may be self-interested and cannot be managed by a centralized controller. Prior approaches fall short in two key areas: they typically focus on single-query routing, neglecting long-term resource reuse (e.g., KV-caching) and the complexities of system-level many-to-many matching; furthermore, they rely on generic incentive mechanisms that ignore the distinct characteristics of LLM inference. To bridge this gap, we propose IEMAS (Incentive-Efficiency Mechanism for Multi-Agent Systems), a distributed framework that aligns economic incentives with system performance. IEMAS integrates a probabilistic predictive model to estimate Quality of Service (QoS) under uncertainty, which feeds into a VCG-based bipartite matching mechanism. This design guarantees truthful capability reporting and social optimality while explicitly leveraging KV cache-affinity to minimize computational redundancy. We implement IEMAS on top of vLLM and evaluate it via extensive simulations. Results demonstrate that our incentive-efficiency co-design reducing average service cost by 35% and end-to-end latency by up to 2.9 compared to baselines.
翻译:向开放、分布式多智能体系统(MAS)的转型有望实现可扩展的智能,但引入了一个不容忽视的矛盾:最大化全局效率需要协作式、资源感知的调度,而自治智能体可能具有自利性且无法由集中式控制器管理。现有方法在两大关键领域存在不足:通常聚焦于单次查询路由,忽视了长期资源复用(例如KV缓存)及系统级多对多匹配的复杂性;此外,依赖通用激励机制而忽略了LLM推理的独有特性。为弥合这一差距,我们提出IEMAS(多智能体系统激励-效率协同机制),这是一个将经济激励与系统性能相协调的分布式框架。IEMAS集成了概率预测模型以评估不确定条件下的服务质量(QoS),其输出将馈送至基于VCG机制的双边匹配系统。该设计在显式利用KV缓存亲和性以最小化计算冗余的同时,保证了能力报告的真实性与社会最优性。我们在vLLM之上实现了IEMAS框架,并通过大规模仿真进行评估。实验结果表明:相较于基线方法,我们的激励-效率协同设计将平均服务成本降低35%,端到端延迟最高减少2.9倍。