The rise of compound AI serving that integrates multiple operators in a pipeline enables end-user applications such as generative AI-powered meeting companions, autonomous driving, and immersive gaming. These workloads span diverse deployment spaces, from cloud-only queries to edge-assisted ones across infrastructure tiers, often including both within an application. Achieving high service goodput -- i.e., meeting service level objectives (SLOs) for pipeline latency, accuracy, and costs -- requires joint planning of operators' placement, configuration, and resource allocation. However, diverse SLOs, varying runtime environments (e.g., heterogeneous device speeds), and a large volume of queries competing for shared infrastructure explode the planning space, making real-time serving and cost-efficient deployment intractable with existing advances. This paper presents Compass, the first SLO-aware query planner that optimizes large-scale compound AI workloads across diverse deployment spaces. Compass decomposes the many-query, multi-SLO planning problem into tractable subproblems while preserving global decision quality, exploiting plan similarities within and across queries to slash the search steps. It further improves per-step efficiency with a plan profiler that performs selective profiling to achieve high-fidelity performance estimates at a fraction of the profiling cost. At runtime, Compass performs query-plan bipartite matching to maximize SLO goodput under resource contentions. Real-world evaluations show that Compass improves service goodput by 2.4--5.1x, reduces deployment costs by 3.8--4.5x, and accelerates planning by 4.2--10.5x, achieving service responsiveness within seconds and near-optimal decision quality.
翻译:复合AI服务通过集成多个流水线算子实现生成式AI会议助手、自动驾驶和沉浸式游戏等终端应用,其兴起推动了此类工作负载的发展。这些工作负载跨越从纯云端查询到边缘辅助查询的多层级基础设施部署空间,且常在同一应用中混合使用两种模式。要达成高服务吞吐率(即满足流水线延迟、准确性和成本的SLO目标),需要联合规划算子的部署位置、配置及资源分配。然而,多样化的服务等级目标(SLO)、动态变化的运行时环境(如异构设备速度差异)以及大量查询对共享基础设施的竞争,导致规划空间呈爆炸式增长,现有技术无法实现实时服务与成本高效的部署。本文提出Compass——首个面向SLO的查询规划器,能在大规模跨部署空间的复合AI工作负载中实现优化。Compass将多查询、多SLO的规划问题分解为可处理的子问题,同时保持全局决策质量,通过利用查询内部及跨查询间的规划相似性大幅缩减搜索步骤。为进一步提升单步效率,其引入选择性性能剖析的计划分析器,以极低的剖析成本实现高保真性能估计。运行时,Compass执行查询-规划二分图匹配以最大化资源竞争下的SLO吞吐率。实际评估表明,Compass可将服务吞吐率提升2.4-5.1倍,部署成本降低3.8-4.5倍,规划速度提高4.2-10.5倍,在秒级响应内实现接近最优的决策质量。