Text-to-image diffusion models have achieved remarkable visual quality but incur high computational costs, making latency-aware, scalable deployment challenging. To address this, we advocate a hybrid architecture that achieves query awareness when serving diffusion models. Unlike existing query-aware serving systems that cascade lightweight and heavyweight models with a fixed configuration, our hybrid architecture first routes each query directly to a suitable model variant, then reroutes it to a cascaded heavyweight model only if necessary. We theoretically analyze conditions for the hybrid architecture to outperform non-hybrid alternatives in latency and response quality. Building on this architecture, we design HADIS, a hybrid serving system for latency-aware diffusion models that jointly optimizes cascade model selection, query routing, and resource allocation. To reduce the complexity of resource management, HADIS uses an offline profiling phase to produce a Pareto-optimal cascade configuration table. At runtime, HADIS selects the best cascade configuration and GPU allocation given latency and workload constraints. Empirical evaluations on real-world traces demonstrate that HADIS improves response quality by up to 35% while reducing latency violation rates by 2.7-45$\times$ compared to state-of-the-art model serving systems.
翻译:文本到图像扩散模型在视觉质量方面取得了显著成就,但带来了高昂的计算成本,使得面向低延迟、可扩展的部署面临挑战。为解决这一问题,我们提出一种混合架构,在服务扩散模型时实现查询感知。与现有采用固定配置级联轻量级和重量级模型的查询感知服务系统不同,我们的混合架构首先将每个查询直接路由到合适的模型变体,仅在必要时才将其重新路由到级联的重量级模型。我们从理论上分析了混合架构在延迟和响应质量上优于非混合替代方案的条件。基于此架构,我们设计了HADIS,一个面向低延迟扩散模型的混合服务系统,它联合优化了级联模型选择、查询路由和资源分配。为降低资源管理的复杂性,HADIS使用离线性能分析阶段生成帕累托最优的级联配置表。在运行时,HADIS根据延迟和工作负载约束选择最佳的级联配置和GPU分配。基于真实世界轨迹的实证评估表明,与最先进的模型服务系统相比,HADIS将响应质量提升高达35%,同时将延迟违规率降低了2.7至45倍。