V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow

In this paper, we introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to generate natural and intelligible speech directly from silent talking face videos. While recent V2S systems have shown promising results on constrained datasets with limited speakers and vocabularies, their performance often degrades on real-world, unconstrained datasets due to the inherent variability and complexity of speech signals. To address these challenges, we decompose the speech signal into manageable subspaces (content, pitch, and speaker information), each representing distinct speech attributes, and predict them directly from the visual input. To generate coherent and realistic speech from these predicted attributes, we employ a rectified flow matching decoder built on a Transformer architecture, which models efficient probabilistic pathways from random noise to the target speech distribution. Extensive experiments demonstrate that V2SFlow significantly outperforms state-of-the-art methods, even surpassing the naturalness of ground truth utterances.

翻译：本文提出V2SFlow，一种新颖的视频到语音生成框架，旨在直接从无声说话人脸视频中生成自然且清晰的语音。尽管现有V2S系统在说话人及词汇量受限的数据集上已展现良好效果，但由于语音信号固有的多样性与复杂性，其在真实无约束数据集上的性能往往显著下降。为解决这些挑战，我们将语音信号分解为可独立建模的子空间（内容、音高及说话人信息），每个子空间表征不同的语音属性，并直接从视觉输入中预测这些属性。为基于预测属性生成连贯且逼真的语音，我们采用基于Transformer架构构建的整流流匹配解码器，该模块可建模从随机噪声到目标语音分布的高效概率路径。大量实验表明，V2SFlow显著优于现有最优方法，甚至在自然度方面超越真实语音样本。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日