Diffusion Transformers (DiTs) have become the dominant architecture for image and video generation, creating growing demand for efficient DiT serving. Existing systems assign each request a fixed parallel configuration throughout its lifetime. However, DiT workloads exhibit substantial heterogeneity across requests, execution stages, and system conditions, making static parallelism inefficient and often leading to poor GPU utilization and degraded service quality. This paper argues that DiT serving should treat GPU parallelism as a first-class schedulable resource. We present GF-DiT, a policy-programmable runtime for elastic DiT serving that dynamically adapts the parallelism of running requests according to workload demands and service objectives. GF-DiT introduces an asynchronous execution abstraction that decomposes requests into independently schedulable trajectory tasks and enables online GPU reallocation. To make elastic parallelism practical, GF-DiT further proposes group-free collectives, a lightweight communication abstraction that supports low-overhead online formation and reconfiguration of arbitrary execution groups. We implement GF-DiT in vLLM-Omni and evaluate it on representative image and video diffusion workloads. Compared with fixed-pipeline execution with static parallelism, GF-DiT improves throughput by up to 6.01$\times$, reduces mean latency by up to 95%, lowers SLO violation rates by up to 90%, and reduces communication-group setup overhead from 778 ms to approximately 60 $μ$s.
翻译:扩散变换器已成为图像与视频生成的主流架构,催生了对高效扩散变换器服务日益增长的需求。现有系统为每个请求在其生命周期内分配固定的并行配置。然而,扩散变换器工作负载在请求、执行阶段及系统条件间表现出显著异构性,导致静态并行策略效率低下,并常引发GPU利用率降低与服务质量下降。本文主张扩散变换器服务应将GPU并行视为一类可调度的首要资源。我们提出GF-DiT——一种策略可编程的弹性扩散变换器运行时系统,能根据工作负载需求与服务目标动态调整运行中请求的并行度。GF-DiT引入异步执行抽象机制,将请求分解为独立可调度的轨迹任务,并支持在线GPU重分配。为使弹性并行切实可行,GF-DiT进一步提出无组通信原语(group-free collectives),这是一种轻量级通信抽象,支持任意执行组的低开销在线组建与重构。我们在vLLM-Omni中实现GF-DiT,并在代表性图像与视频扩散工作负载上进行评估。与采用静态并行的固定流水线执行相比,GF-DiT将吞吐量提升高达6.01倍,平均延迟降低高达95%,服务等级协议违反率降低高达90%,并将通信组建开销从778毫秒降低至约60微秒。