TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation

Direct speech-to-speech translation achieves high-quality results through the introduction of discrete units obtained from self-supervised learning. This approach circumvents delays and cascading errors associated with model cascading. However, talking head translation, converting audio-visual speech (i.e., talking head video) from one language into another, still confronts several challenges compared to audio speech: (1) Existing methods invariably rely on cascading, synthesizing via both audio and text, resulting in delays and cascading errors. (2) Talking head translation has a limited set of reference frames. If the generated translation exceeds the length of the original speech, the video sequence needs to be supplemented by repeating frames, leading to jarring video transitions. In this work, we propose a model for talking head translation, \textbf{TransFace}, which can directly translate audio-visual speech into audio-visual speech in other languages. It consists of a speech-to-unit translation model to convert audio speech into discrete units and a unit-based audio-visual speech synthesizer, Unit2Lip, to re-synthesize synchronized audio-visual speech from discrete units in parallel. Furthermore, we introduce a Bounded Duration Predictor, ensuring isometric talking head translation and preventing duplicate reference frames. Experiments demonstrate that our proposed Unit2Lip model significantly improves synchronization (1.601 and 0.982 on LSE-C for the original and generated audio speech, respectively) and boosts inference speed by a factor of 4.35 on LRS2. Additionally, TransFace achieves impressive BLEU scores of 61.93 and 47.55 for Es-En and Fr-En on LRS3-T and 100% isochronous translations.

翻译：直接语音到语音翻译通过引入自监督学习获得的离散单元实现了高质量结果，该方法避免了模型级联带来的延迟和级联误差。然而，相较于音频语音，将视听语音（即说话人头视频）从一种语言转换为另一种语言的说话人头翻译仍面临若干挑战：（1）现有方法均依赖级联方式，通过音频和文本进行合成，导致延迟和级联误差。（2）说话人头翻译的参考帧有限，若生成翻译长度超过原始语音，需通过重复帧补充视频序列，造成视频过渡生硬。本文提出面向说话人头翻译的模型TransFace，该模型能够直接将视听语音翻译为其他语言的视听语音。其包含语音到单元翻译模型（将音频语音转换为离散单元）和基于单元的视听语音合成器Unit2Lip（从离散单元并行重建同步的视听语音）。此外，我们引入有界时长预测器，确保等时说话人头翻译并避免重复参考帧。实验表明，所提出的Unit2Lip模型显著提升同步性（原始与生成音频语音在LSE-C上分别达到1.601和0.982），并在LRS2上将推理速度提升4.35倍。同时，TransFace在LRS3-T的爱沙尼亚语-英语和法语-英语翻译中分别取得61.93和47.55的优异BLEU分数，并实现100%等时翻译。