In this paper, we introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to generate natural and intelligible speech directly from silent talking face videos. While recent V2S systems have shown promising results on constrained datasets with limited speakers and vocabularies, their performance often degrades on real-world, unconstrained datasets due to the inherent variability and complexity of speech signals. To address these challenges, we decompose the speech signal into manageable subspaces (content, pitch, and speaker information), each representing distinct speech attributes, and predict them directly from the visual input. To generate coherent and realistic speech from these predicted attributes, we employ a rectified flow matching decoder built on a Transformer architecture, which models efficient probabilistic pathways from random noise to the target speech distribution. Extensive experiments demonstrate that V2SFlow significantly outperforms state-of-the-art methods, even surpassing the naturalness of ground truth utterances.
翻译:本文提出V2SFlow,一种新颖的视频到语音生成框架,旨在直接从无声说话人脸视频中生成自然且清晰的语音。尽管现有V2S系统在说话人及词汇量受限的数据集上已展现良好效果,但由于语音信号固有的多样性与复杂性,其在真实无约束数据集上的性能往往显著下降。为解决这些挑战,我们将语音信号分解为可独立建模的子空间(内容、音高及说话人信息),每个子空间表征不同的语音属性,并直接从视觉输入中预测这些属性。为基于预测属性生成连贯且逼真的语音,我们采用基于Transformer架构构建的整流流匹配解码器,该模块可建模从随机噪声到目标语音分布的高效概率路径。大量实验表明,V2SFlow显著优于现有最优方法,甚至在自然度方面超越真实语音样本。