Current motion-conditioned video generation methods suffer from prohibitive latency (minutes per video) and non-causal processing that prevents real-time interaction. We present MotionStream, enabling sub-second latency with up to 29 FPS streaming generation on a single GPU. Our approach begins by augmenting a text-to-video model with motion control, which generates high-quality videos that adhere to the global text prompt and local motion guidance, but does not perform inference on the fly. As such, we distill this bidirectional teacher into a causal student through Self Forcing with Distribution Matching Distillation, enabling real-time streaming inference. Several key challenges arise when generating videos of long, potentially infinite time-horizons -- (1) bridging the domain gap from training on finite length and extrapolating to infinite horizons, (2) sustaining high quality by preventing error accumulation, and (3) maintaining fast inference, without incurring growth in computational cost due to increasing context windows. A key to our approach is introducing carefully designed sliding-window causal attention, combined with attention sinks. By incorporating self-rollout with attention sinks and KV cache rolling during training, we properly simulate inference-time extrapolations with a fixed context window, enabling constant-speed generation of arbitrarily long videos. Our models achieve state-of-the-art results in motion following and video quality while being two orders of magnitude faster, uniquely enabling infinite-length streaming. With MotionStream, users can paint trajectories, control cameras, or transfer motion, and see results unfold in real-time, delivering a truly interactive experience.
翻译:当前基于运动条件的视频生成方法存在延迟过高(每分钟生成视频)和非因果处理的问题,阻碍了实时交互的实现。本文提出MotionStream,在单GPU上实现亚秒级延迟和最高29 FPS的流式生成。我们的方法首先通过运动控制增强文本到视频模型,该模型能生成符合全局文本提示和局部运动引导的高质量视频,但无法进行实时推理。为此,我们通过基于分布匹配蒸馏的自强制学习,将这种双向教师模型蒸馏为因果学生模型,从而实现实时流式推理。在生成长时间(可能无限)视频时面临若干关键挑战:(1)弥合有限长度训练与无限时长外推之间的领域差距;(2)通过防止误差累积维持高质量输出;(3)在不断增加上下文窗口的情况下保持快速推理,避免计算成本增长。我们方法的核心是引入精心设计的滑动窗口因果注意力机制与注意力沉没技术。通过在训练中结合注意力沉没的自展开和KV缓存滚动策略,我们以固定上下文窗口准确模拟推理时的外推过程,实现任意长度视频的恒定速度生成。我们的模型在运动跟随和视频质量方面达到最先进水平,同时速度提升两个数量级,首次实现了无限长度流式生成。借助MotionStream,用户可通过绘制轨迹、控制摄像机或迁移运动等方式,实时观察生成结果,获得真正的交互式体验。