Diffusion Transformer (DiT) models excel at generating high-quality images through iterative denoising steps, but serving them under strict Service Level Objectives (SLOs) is challenging due to their high computational cost, particularly at large resolutions. Existing serving systems use fixed degree sequence parallelism, which is inefficient for heterogeneous workloads with mixed resolutions and deadlines, leading to poor GPU utilization and low SLO attainment. In this paper, we propose step-level sequence parallelism to dynamically adjust the degree of parallelism of individual requests according to their deadlines. We present TetriServe, a DiT serving system that implements this strategy for highly efficient image generation. Specifically, TetriServe introduces a novel round-based scheduling mechanism that improves SLO attainment: (1) discretizing time into fixed rounds to make deadline-aware scheduling tractable, (2) adapting parallelism at the step level and minimize GPU hour consumption, and (3) jointly packing requests to minimize late completions. Extensive evaluation on state-of-the-art DiT models shows that TetriServe achieves up to 32% higher SLO attainment compared to existing solutions without degrading image quality.
翻译:扩散Transformer(DiT)模型通过迭代去噪步骤在生成高质量图像方面表现出色,但由于其高计算成本(尤其是在高分辨率下),在严格的服务水平目标(SLO)约束下提供服务具有挑战性。现有服务系统采用固定程度的序列并行策略,对于包含混合分辨率与截止时间的异构工作负载效率低下,导致GPU利用率低且SLO达成率不佳。本文提出步骤级序列并行策略,可根据请求的截止时间动态调整单个请求的并行度。我们提出了TetriServe,一个实现了该策略以进行高效图像生成的DiT服务系统。具体而言,TetriServe引入了一种新颖的基于轮次的调度机制以提升SLO达成率:(1)将时间离散化为固定轮次,使基于截止时间的调度易于处理;(2)在步骤层面自适应调整并行度,并最小化GPU时耗;(3)联合打包请求以最小化延迟完成。在最先进的DiT模型上进行的大量实验表明,与现有解决方案相比,TetriServe在不降低图像质量的前提下,可将SLO达成率最高提升32%。