Zero-shot text-to-speech (TTS) has gained significant attention due to its powerful voice cloning capabilities, requiring only a few seconds of unseen speaker voice prompts. However, all previous work has been developed for cloud-based systems. Taking autoregressive models as an example, although these approaches achieve high-fidelity voice cloning, they fall short in terms of inference speed, model size, and robustness. Therefore, we propose MobileSpeech, which is a fast, lightweight, and robust zero-shot text-to-speech system based on mobile devices for the first time. Specifically: 1) leveraging discrete codec, we design a parallel speech mask decoder module called SMD, which incorporates hierarchical information from the speech codec and weight mechanisms across different codec layers during the generation process. Moreover, to bridge the gap between text and speech, we introduce a high-level probabilistic mask that simulates the progression of information flow from less to more during speech generation. 2) For speaker prompts, we extract fine-grained prompt duration from the prompt speech and incorporate text, prompt speech by cross attention in SMD. We demonstrate the effectiveness of MobileSpeech on multilingual datasets at different levels, achieving state-of-the-art results in terms of generating speed and speech quality. MobileSpeech achieves RTF of 0.09 on a single A100 GPU and we have successfully deployed MobileSpeech on mobile devices. Audio samples are available at \url{https://mobilespeech.github.io/} .
翻译:零样本文本转语音(TTS)因其强大的语音克隆能力而受到广泛关注,仅需数秒未见过的说话人语音提示即可实现克隆。然而,此前所有工作均针对云端系统开发。以自回归模型为例,尽管这些方法实现了高保真语音克隆,但在推理速度、模型大小和鲁棒性方面存在不足。为此,我们首次提出MobileSpeech——一个基于移动设备的快速、轻量且鲁棒的零样本文本转语音系统。具体而言:1)利用离散编解码器,我们设计了一个名为SMD的并行语音掩码解码器模块,该模块在生成过程中融合了语音编解码器的层次信息以及不同编解码层间的权重机制。此外,为弥合文本与语音之间的鸿沟,我们引入了一种高层概率掩码,用于模拟语音生成过程中信息流从少到多的演进过程。2)对于说话人提示,我们从提示语音中提取细粒度提示时长,并通过SMD中的交叉注意力机制融合文本与提示语音。我们在多语言数据集上从不同层面验证了MobileSpeech的有效性,在生成速度和语音质量方面均达到最先进水平。MobileSpeech在单张A100 GPU上实现了0.09的实时因子,并已成功部署于移动设备。音频样本请访问 \url{https://mobilespeech.github.io/}。