We present UTDUSS, the UTokyo-SaruLab system submitted to Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge. The challenge focuses on using discrete speech unit learned from large speech corpora for some tasks. We submitted our UTDUSS system to two text-to-speech tracks: Vocoder and Acoustic+Vocoder. Our system incorporates neural audio codec (NAC) pre-trained on only speech corpora, which makes the learned codec represent rich acoustic features that are necessary for high-fidelity speech reconstruction. For the acoustic+vocoder track, we trained an acoustic model based on Transformer encoder-decoder that predicted the pre-trained NAC tokens from text input. We describe our strategies to build these models, such as data selection, downsampling, and hyper-parameter tuning. Our system ranked in second and first for the Vocoder and Acoustic+Vocoder tracks, respectively.
翻译:我们提出的UTDUSS(东京大学SaruLab系统)参加了Interspeech2024基于离散语音单元的语音处理挑战赛。该项挑战旨在利用从大型语音语料库中学习到的离散语音单元完成特定任务。我们将UTDUSS系统提交至两个文本转语音赛道:声码器赛道和声学+声码器赛道。该系统采用仅在语音语料库上预训练的神经音频编解码器(NAC),使得学习到的编解码器能够表征高保真语音重建所需的丰富声学特征。针对声学+声码器赛道,我们训练了基于Transformer编码器-解码器的声学模型,该模型可从文本输入预测预训练NAC令牌。我们阐述了构建这些模型的具体策略,包括数据选择、降采样和超参数调优。最终,我们的系统在声码器赛道和声学+声码器赛道中分别获得第二名和第一名。