Large-scale unsupervised audio pre-training for video-to-speech synthesis

from arxiv, Corrected typos. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker. Most established approaches to date involve a two-step process, whereby an intermediate representation from the video, such as a spectrogram, is extracted first and then passed to a vocoder to produce the raw audio. Some recent work has focused on end-to-end synthesis, whereby the generation of raw audio and any intermediate representations is performed jointly. All such approaches involve training on data from almost exclusively audio-visual datasets, i.e. every audio sample has a corresponding video sample. This precludes the use of abundant audio-only datasets which may not have a corresponding visual modality (e.g. audiobooks, radio podcasts, speech recognition datasets etc.), as well as audio-only architectures that have been developed by the audio machine learning community over the years. In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz, and then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task. The pre-training step uses audio samples only and does not require labels or corresponding samples from other modalities (visual, text). We demonstrate that this pre-training step improves the reconstructed speech and that it is an unexplored way to improve the quality of the generator in a cross-modal task while only requiring samples from one of the modalities. We conduct experiments using both raw audio and mel spectrograms as target outputs and benchmark our models with existing work.

翻译：视频到语音合成是从说话者的无声视频中重建语音信号的任务。迄今为止，大多数已建立的方法涉及两步过程：首先从视频中提取中间表示（例如声谱图），然后将其传递给声码器以生成原始音频。近期一些研究集中于端到端合成，即同时生成原始音频和任何中间表示。所有这些方法几乎都仅使用音视频数据集进行训练，即每个音频样本都有对应的视频样本。这排除了大量可能没有对应视觉模态的纯音频数据集（例如有声书、广播播客、语音识别数据集等）的利用，也排除了音频机器学习社区多年来开发的纯音频架构。在本文中，我们提出在24kHz采样率下对超过3500小时的音频数据训练编码器-解码器模型，然后利用预训练的解码器初始化视频到语音合成任务的音频解码器。预训练步骤仅使用音频样本，不需要来自其他模态（视觉、文本）的标签或对应样本。我们证明，这种预训练步骤能改善重建语音，并且是在跨模态任务中仅需单模态样本即可提升生成器质量的一种未被探索的方法。我们使用原始音频和梅尔声谱图作为目标输出进行实验，并与现有工作的模型进行基准测试。