Interactive humanoid video generation aims to synthesize lifelike visual agents that can engage with humans through continuous and responsive video. Despite recent advances in video synthesis, existing methods often grapple with the trade-off between high-fidelity synthesis and real-time interaction requirements. In this paper, we propose FlowAct-R1, a framework specifically designed for real-time interactive humanoid video generation. Built upon a MMDiT architecture, FlowAct-R1 enables the streaming synthesis of video with arbitrary durations while maintaining low-latency responsiveness. We introduce a chunkwise diffusion forcing strategy, complemented by a novel self-forcing variant, to alleviate error accumulation and ensure long-term temporal consistency during continuous interaction. By leveraging efficient distillation and system-level optimizations, our framework achieves a stable 25fps at 480p resolution with a time-to-first-frame (TTFF) of only around 1.5 seconds. The proposed method provides holistic and fine-grained full-body control, enabling the agent to transition naturally between diverse behavioral states in interactive scenarios. Experimental results demonstrate that FlowAct-R1 achieves exceptional behavioral vividness and perceptual realism, while maintaining robust generalization across diverse character styles.
翻译:交互式人形视频生成旨在合成能够通过连续且响应式视频与人类互动的逼真视觉智能体。尽管视频合成领域近期取得了进展,但现有方法通常需要在高质量合成与实时交互需求之间进行权衡。本文提出FlowAct-R1,一个专为实时交互式人形视频生成设计的框架。基于MMDiT架构构建,FlowAct-R1能够实现任意时长视频的流式合成,同时保持低延迟响应。我们引入了一种分块扩散强制策略,并辅以新颖的自强制变体,以减轻误差累积并确保连续交互过程中的长期时间一致性。通过利用高效蒸馏和系统级优化,我们的框架在480p分辨率下实现了稳定的25fps帧率,首帧时间(TTFF)仅约1.5秒。所提出的方法提供了整体且细粒度的全身控制,使智能体能够在交互场景中自然地过渡于不同行为状态之间。实验结果表明,FlowAct-R1在实现卓越行为生动性和感知真实感的同时,保持了跨多样角色风格的鲁棒泛化能力。