Diffusion Transformer (DiT) models excel at generating highquality images through iterative denoising steps, but serving them under strict Service Level Objectives (SLOs) is challenging due to their high computational cost, particularly at large resolutions. Existing serving systems use fixed degree sequence parallelism, which is inefficient for heterogeneous workloads with mixed resolutions and deadlines, leading to poor GPU utilization and low SLO attainment. In this paper, we propose step-level sequence parallelism to dynamically adjust the parallel degree of individual requests according to their deadlines. We present TetriServe, a DiT serving system that implements this strategy for highly efficient image generation. Specifically, TetriServe introduces a novel round-based scheduling mechanism that improves SLO attainment: (1) discretizing time into fixed rounds to make deadline-aware scheduling tractable, (2) adapting parallelism at the step level and minimize GPU hour consumption, and (3) jointly packing requests to minimize late completions. Extensive evaluation on state-of-the-art DiT models shows that TetriServe achieves up to 32% higher SLO attainment compared to existing solutions without degrading image quality.
翻译:扩散Transformer(DiT)模型通过迭代去噪步骤能够生成高质量图像,但由于其高昂的计算成本(尤其是在高分辨率下),在严格的服务水平目标(SLO)约束下提供服务面临挑战。现有服务系统采用固定程度的序列并行策略,对于混合分辨率与截止时间的异构工作负载效率低下,导致GPU利用率不佳且SLO达成率低。本文提出步骤级序列并行方法,可根据各请求的截止时间动态调整其并行度。我们介绍了TetriServe——一个实现该策略的高效图像生成DiT服务系统。具体而言,TetriServe引入了一种创新的基于轮次的调度机制以提升SLO达成率:(1)将时间离散化为固定轮次,使基于截止时间的调度可处理;(2)在步骤层级自适应调整并行度以最小化GPU时耗;(3)联合打包请求以减少延迟完成。在先进DiT模型上的大量实验表明,TetriServe在不降低图像质量的前提下,相比现有解决方案可实现高达32%的SLO达成率提升。