Generating realistic, dyadic talking head video requires ultra-low latency. Existing chunk-based methods require full non-causal context windows, introducing significant delays. This high latency critically prevents the immediate, non-verbal feedback required for a realistic listener. To address this, we present DyStream, a flow matching-based autoregressive model that could generate video in real-time from both speaker and listener audio. Our method contains two key designs: (1) we adopt a stream-friendly autoregressive framework with flow-matching heads for probabilistic modeling, and (2) We propose a causal encoder enhanced by a lookahead module to incorporate short future context (e.g., 60 ms) to improve quality while maintaining low latency. Our analysis shows this simple-and-effective method significantly surpass alternative causal strategies, including distillation and generative encoder. Extensive experiments show that DyStream could generate video within 34 ms per frame, guaranteeing the entire system latency remains under 100 ms. Besides, it achieves state-of-the-art lip-sync quality, with offline and online LipSync Confidence scores of 8.13 and 7.61 on HDTF, respectively. The model, weights and codes are available.
翻译:生成逼真的双人对话头部视频需要极低延迟。现有的基于片段的方法需要完整的非因果上下文窗口,引入了显著延迟。这种高延迟严重阻碍了实现逼真倾听者所需的即时非语言反馈。为解决此问题,我们提出了DyStream,一种基于流匹配的自回归模型,能够根据说话者和倾听者的音频实时生成视频。我们的方法包含两个关键设计:(1) 采用流友好的自回归框架,配备用于概率建模的流匹配头部;(2) 提出一种由前瞻模块增强的因果编码器,以纳入短期未来上下文(例如60毫秒),在保持低延迟的同时提升质量。我们的分析表明,这种简单而有效的方法显著超越了包括蒸馏和生成式编码器在内的其他因果策略。大量实验表明,DyStream能够在每帧34毫秒内生成视频,确保整个系统延迟保持在100毫秒以下。此外,它在唇形同步质量上达到了最先进水平,在HDTF数据集上的离线和在线LipSync Confidence分数分别为8.13和7.61。模型、权重及代码均已公开。