Recent advancements in text-to-speech (TTS) powered by language models have showcased remarkable capabilities in achieving naturalness and zero-shot voice cloning. Notably, the decoder-only transformer is the prominent architecture in this domain. However, transformers face challenges stemming from their quadratic complexity in sequence length, impeding training on lengthy sequences and resource-constrained hardware. Moreover they lack specific inductive bias with regards to the monotonic nature of TTS alignments. In response, we propose to replace transformers with emerging recurrent architectures and introduce specialized cross-attention mechanisms for reducing repeating and skipping issues. Consequently our architecture can be efficiently trained on long samples and achieve state-of-the-art zero-shot voice cloning against baselines of comparable size.
翻译:近年来,基于语言模型的文本到语音(TTS)技术在实现自然度和零样本语音克隆方面展现出卓越能力。值得注意的是,仅解码器Transformer已成为该领域的主流架构。然而,Transformer因其序列长度的二次方复杂度而面临挑战,阻碍了在长序列和资源受限硬件上的训练。此外,该架构缺乏针对TTS对齐单调性的特定归纳偏置。为此,我们提出采用新兴的循环架构替代Transformer,并引入专门的交叉注意力机制以减少重复与跳读问题。因此,我们的架构能够在长样本上高效训练,并在同等规模基线对比中实现最先进的零样本语音克隆性能。