Autoregressive diffusion transformers (AR-DiTs) recast video generation from an offline paradigm to a real-time streaming one: the model generates video one chunk at a time, making each chunk available for playout once produced. The service-level objective (SLO) for this paradigm is no longer fixed latency or throughput but the preservation of playout continuity: generation must stay ahead of the playout timeline. Once generation falls behind, the remaining playable buffer (playout slack) is exhausted, and users experience visible stalls. This objective reveals two serving design insights. First, real-time video generation has a dynamic SLO that evolves with playout progress, so resources should move toward streams with lower playout slack. Second, an acceptable chunk delivered on time is preferable to a late high-fidelity chunk, so per-chunk fidelity configurations should adapt to available playout slack. Guided by these insights, we present SlackServe, a playout-slack-driven serving system that preserves playout continuity in real-time streaming video generation. SlackServe uses playout slack as a unified signal, reallocating resources across streams through three-tier priority queues, re-homing, and elastic sequence parallelism, while selecting per-chunk fidelity configurations within each stream through Bi-Modal Pareto Routing under a quality floor. On a 16-H100 GPU cluster, SlackServe improves Quality of Experience (QoE), measured by Continuous Play Ratio (CPR), by 1.64x-3.29x and reduces Time to First Chunk (TTFC) by 1.61x-9.65x over baselines, while preserving comparable generation quality.
翻译:自回归扩散变换器(AR-DiTs)将视频生成从离线范式转变为实时流式范式:模型逐块生成视频,每个块一旦生成即可用于播放。该范式的服务级目标不再是固定延迟或吞吐量,而是保障播放连续性:生成速度必须领先于播放时间线。一旦生成落后,剩余可播放缓冲区(播放松弛)耗尽,用户将经历可见卡顿。这一目标揭示了两个服务设计洞见。首先,实时视频生成具有随播放进度演变的动态SLO,因此资源应向播放松弛较低的流倾斜。其次,按时交付的可接受块优于迟来的高保真块,因此每块的保真配置应适应可用播放松弛。基于这些洞见,我们提出SlackServe——一种播放松弛驱动的服务系统,用于在实时流式视频生成中保障播放连续性。SlackServe以播放松驰为统一信号,通过三级优先级队列、重路由和弹性序列并行在流间重新分配资源,同时在质量底线约束下通过双模帕累托路由选择每条流内的每块保真配置。在16-H100 GPU集群上,SlackServe将用户体验质量(以连续播放比CPR衡量)提升1.64倍至3.29倍,将首块到达时间降低1.61倍至9.65倍,同时保持相当的生成质量。