TetriServe: Efficient DiT Serving for Heterogeneous Image Generation

Diffusion Transformer (DiT) models excel at generating highquality images through iterative denoising steps, but serving them under strict Service Level Objectives (SLOs) is challenging due to their high computational cost, particularly at large resolutions. Existing serving systems use fixed degree sequence parallelism, which is inefficient for heterogeneous workloads with mixed resolutions and deadlines, leading to poor GPU utilization and low SLO attainment. In this paper, we propose step-level sequence parallelism to dynamically adjust the parallel degree of individual requests according to their deadlines. We present TetriServe, a DiT serving system that implements this strategy for highly efficient image generation. Specifically, TetriServe introduces a novel round-based scheduling mechanism that improves SLO attainment: (1) discretizing time into fixed rounds to make deadline-aware scheduling tractable, (2) adapting parallelism at the step level and minimize GPU hour consumption, and (3) jointly packing requests to minimize late completions. Extensive evaluation on state-of-the-art DiT models shows that TetriServe achieves up to 32% higher SLO attainment compared to existing solutions without degrading image quality.

翻译：扩散Transformer（DiT）模型通过迭代去噪步骤能够生成高质量图像，但由于其高昂的计算成本（尤其是在高分辨率下），在严格的服务水平目标（SLO）约束下提供服务面临挑战。现有服务系统采用固定程度的序列并行策略，对于混合分辨率与截止时间的异构工作负载效率低下，导致GPU利用率不佳且SLO达成率低。本文提出步骤级序列并行方法，可根据各请求的截止时间动态调整其并行度。我们介绍了TetriServe——一个实现该策略的高效图像生成DiT服务系统。具体而言，TetriServe引入了一种创新的基于轮次的调度机制以提升SLO达成率：（1）将时间离散化为固定轮次，使基于截止时间的调度可处理；（2）在步骤层级自适应调整并行度以最小化GPU时耗；（3）联合打包请求以减少延迟完成。在先进DiT模型上的大量实验表明，TetriServe在不降低图像质量的前提下，相比现有解决方案可实现高达32%的SLO达成率提升。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日