Continuous-space video generation has advanced rapidly, while discrete approaches lag behind due to error accumulation and long-context inconsistency. In this work, we revisit discrete generative modeling and present Uniform discRete diffuSion with metric pAth (URSA), a simple yet powerful framework that bridges the gap with continuous approaches for the scalable video generation. At its core, URSA formulates the video generation task as an iterative global refinement of discrete spatiotemporal tokens. It integrates two key designs: a Linearized Metric Path and a Resolution-dependent Timestep Shifting mechanism. These designs enable URSA to scale efficiently to high-resolution image synthesis and long-duration video generation, while requiring significantly fewer inference steps. Additionally, we introduce an asynchronous temporal fine-tuning strategy that unifies versatile tasks within a single model, including interpolation and image-to-video generation. Extensive experiments on challenging video and image generation benchmarks demonstrate that URSA consistently outperforms existing discrete methods and achieves performance comparable to state-of-the-art continuous diffusion methods. Code and models are available at https://github.com/baaivision/URSA
翻译:连续空间视频生成技术发展迅速,而离散方法因误差累积和长上下文不一致问题进展滞后。本研究重新审视离散生成建模,提出基于度量路径的均匀离散扩散框架(URSA),这是一个简洁而强大的框架,弥合了离散方法与连续方法在可扩展视频生成领域的差距。URSA的核心是将视频生成任务构建为离散时空标记的迭代全局优化过程。该框架整合了两个关键设计:线性化度量路径和分辨率相关的时间步偏移机制。这些设计使URSA能够高效扩展至高分辨率图像合成和长时视频生成,同时显著减少推理步骤。此外,我们提出异步时序微调策略,将插值和图像到视频生成等多种任务统一在单一模型中。在具有挑战性的视频与图像生成基准测试上的大量实验表明,URSA持续超越现有离散方法,并达到与最先进连续扩散方法相当的性能。代码与模型已发布于 https://github.com/baaivision/URSA