Large language models are increasingly deployed as complex agentic systems that scale with task complexity. While prior work has extensively explored model- and system-level scaling, algorithm- and task-level scaling remain largely unaddressed, constraining the full potential of agentic systems. At the algorithm level, allocating additional inference-time computation can enhance workflow capacity but introduces cross-path redundancy: overlapping computations across multiple reasoning branches. At the task level, complex tasks can be decomposed into subproblems and delegated across multiple agents for improved scalability and parallelism. However, existing infrastructures' scheduling is unaware of the existence of multiple agents, missing opportunities to optimize resource allocation. We propose Hive, a multi-agent infrastructure that enables algorithm- and task-level scaling. Hive features a description frontend that captures per-agent behavior and supports test-time scaling algorithms. Leveraging this specification, our backend introduces two key mechanisms: Logits Cache that reuses intermediate logits across redundant sampling paths to mitigate cross-path redundancy at the algorithm level, and Agent-Aware Scheduling that efficiently allocates compute and KV-cache resources according to agent contributions at the task level. Experiments show that Logits Cache achieves an average speedup of $1.11\times$-$1.76\times$ for re-sampling, and Agent-Aware Scheduling reduces the hotspot miss rate by $33\%$-$51\%$.
翻译:摘要:大型语言模型正日益被部署为随任务复杂度扩展的复杂智能体系统。尽管先前研究已广泛探索模型级与系统级扩展,但算法级与任务级扩展仍未得到充分解决,制约了智能体系统的潜力发挥。在算法层面,增加推理时计算量可提升工作流能力,但会引入跨路径冗余:多个推理分支间的重叠计算。在任务层面,复杂任务可分解为子问题并通过多智能体委派实现可扩展性与并行性的提升。然而,现有基础设施的调度机制无法感知多智能体的存在,错失优化资源分配的机会。我们提出Hive——一种支持算法与任务层面扩展的多智能体基础设施。Hive配备描述性前端,可捕获每个智能体的行为特征并支持测试时扩展算法。基于该规范,后端引入两项关键机制:对数置信缓存——在算法层面复用冗余采样路径的中间对数置信度以缓解跨路径冗余;以及智能体感知调度——在任务层面根据智能体贡献高效分配计算与KV缓存资源。实验表明,对数置信缓存使重采样速度平均提升$1.11\times$-$1.76\times$,智能体感知调度将热点缺失率降低$33\%$-$51\%$。