ReFree: Towards Realistic Co-Speech Video Generation via Reward-Free RL and Multilevel Speech Guidance

Speech-driven talking character animation seeks to generate life-like portrait videos that convey natural conversation behavior, aligning facial motion with spoken audio. Although recent advances in video generation have substantially improved realism in video-based animation, achieving both accurate lip articulation and expressive behavior remains challenging. Existing approaches typically trade off precise phoneme-to-lip synchronization against dynamic facial expressions and head motion, yielding animations that are either accurate yet rigid, or expressive but poorly synchronized. We address this challenge by proposing ReFree-S2V, a flow-matching speech-to-portrait animation framework that builds upon a pretrained video generation model to achieve fine-grained speech articulation and high-level expressive cues in speech-driven portrait animation. This model introduces a multi-level speech representation capturing phonetic and prosodic information at both local and global granularities. These representations are selectively injected into transformer blocks via learnable level selectors, enabling both accurate lip synchronization and natural expressive motion. To achieve natural head movements, we further introduce a novel reward-free reinforcement learning scheme into flow-matching training to discourage perceptually implausible motion without relying on handcrafted synchronization metrics or reward models, or the high cost of human preference annotation. Extensive experiments demonstrate that ReFree-S2V achieves state-of-the-art performance, significantly outperforming existing methods in both quantitative lip-sync accuracy and qualitative human evaluations of naturalness and expressivity.

翻译：语音驱动说话角色动画旨在生成展现自然对话行为、使面部运动与语音音频同步的逼真肖像视频。尽管视频生成领域的最新进展显著提升了基于视频动画的真实感，但实现精准唇部发音与表情动作的兼顾仍具挑战。现有方法通常需要在精确的音素-唇部同步与动态面部表情及头部运动之间权衡，导致生成的动画要么精准但僵硬，要么生动但同步性差。我们通过提出ReFree-S2V（一种基于流匹配的语音到肖像动画框架）来应对这一挑战。该框架构建于预训练视频生成模型之上，可在语音驱动肖像动画中实现细粒度语音发音与高层次表情线索的协同。该模型引入了一种多层语音表示，能同时捕捉局部与全局层面的音素及韵律信息。这些表示通过可学习的层级选择器选择性注入Transformer模块，从而实现精准唇同步与自然表情动作的统一。为实现自然头部运动，我们进一步在流匹配训练中引入了一种新颖的无奖励强化学习方案，无需依赖手工设计的同步指标、奖励模型或代价高昂的人工偏好标注，即可抑制感知上不合理的运动。大量实验表明，ReFree-S2V达到了当前最优性能，在定量唇同步精度及定性人工评估的自然度与表现力上均显著优于现有方法。