Compass: SLO-aware Query Planner for Compound AI Serving at Scale

The rise of compound AI serving that integrates multiple operators in a pipeline enables end-user applications such as generative AI-powered meeting companions, autonomous driving, and immersive gaming. These workloads span diverse deployment spaces, from cloud-only queries to edge-assisted ones across infrastructure tiers, often including both within an application. Achieving high service goodput -- i.e., meeting service level objectives (SLOs) for pipeline latency, accuracy, and costs -- requires joint planning of operators' placement, configuration, and resource allocation. However, diverse SLOs, varying runtime environments (e.g., heterogeneous device speeds), and a large volume of queries competing for shared infrastructure explode the planning space, making real-time serving and cost-efficient deployment intractable with existing advances. This paper presents Compass, the first SLO-aware query planner that optimizes large-scale compound AI workloads across diverse deployment spaces. Compass decomposes the many-query, multi-SLO planning problem into tractable subproblems while preserving global decision quality, exploiting plan similarities within and across queries to slash the search steps. It further improves per-step efficiency with a plan profiler that performs selective profiling to achieve high-fidelity performance estimates at a fraction of the profiling cost. At runtime, Compass performs query-plan bipartite matching to maximize SLO goodput under resource contentions. Real-world evaluations show that Compass improves service goodput by 2.4--5.1x, reduces deployment costs by 3.8--4.5x, and accelerates planning by 4.2--10.5x, achieving service responsiveness within seconds and near-optimal decision quality.

翻译：复合AI服务通过集成多个流水线算子实现生成式AI会议助手、自动驾驶和沉浸式游戏等终端应用，其兴起推动了此类工作负载的发展。这些工作负载跨越从纯云端查询到边缘辅助查询的多层级基础设施部署空间，且常在同一应用中混合使用两种模式。要达成高服务吞吐率（即满足流水线延迟、准确性和成本的SLO目标），需要联合规划算子的部署位置、配置及资源分配。然而，多样化的服务等级目标（SLO）、动态变化的运行时环境（如异构设备速度差异）以及大量查询对共享基础设施的竞争，导致规划空间呈爆炸式增长，现有技术无法实现实时服务与成本高效的部署。本文提出Compass——首个面向SLO的查询规划器，能在大规模跨部署空间的复合AI工作负载中实现优化。Compass将多查询、多SLO的规划问题分解为可处理的子问题，同时保持全局决策质量，通过利用查询内部及跨查询间的规划相似性大幅缩减搜索步骤。为进一步提升单步效率，其引入选择性性能剖析的计划分析器，以极低的剖析成本实现高保真性能估计。运行时，Compass执行查询-规划二分图匹配以最大化资源竞争下的SLO吞吐率。实际评估表明，Compass可将服务吞吐率提升2.4-5.1倍，部署成本降低3.8-4.5倍，规划速度提高4.2-10.5倍，在秒级响应内实现接近最优的决策质量。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

AI 自动研究：路线图与用户指南

专知会员服务

18+阅读 · 5月19日

构建面向终端的 AI 编程智能体：脚手架、测试环境、上下文工程及实践经验

专知会员服务

25+阅读 · 3月8日

中国信通院规划所发布《人工智能算力基础设施赋能研究报告（2025年）》

专知会员服务

21+阅读 · 2025年12月7日

美智库《获取生成式人工智能以提升美国防部影响力活动效能》最新报告

专知会员服务

24+阅读 · 2025年7月23日