Discrete representation has shown advantages in speech generation tasks, wherein discrete tokens are derived by discretizing hidden features from self-supervised learning (SSL) pre-trained models. However, the direct application of speech SSL models to singing generation encounters domain gaps between speech and singing. Furthermore, singing generation necessitates a more refined representation than typical speech. To address these challenges, we introduce SingOMD, a novel method to extract singing-oriented multi-resolution discrete representations from speech SSL models. Specifically, we first adapt the features from speech SSL through a resynthesis task and incorporate multi-resolution modules based on resampling to better serve singing generation. These adapted multi-resolution features are then discretized via clustering. Extensive experiments demonstrate the robustness, efficiency, and effectiveness of these representations in singing vocoders and singing voice synthesis.
翻译:离散表示在语音生成任务中已显示出优势,其离散标记是通过对自监督学习预训练模型的隐藏特征进行离散化而获得的。然而,将语音自监督学习模型直接应用于歌唱生成,会遇到语音与歌唱之间的领域差异。此外,歌唱生成比典型的语音生成需要更精细的表示。为了应对这些挑战,我们提出了SingOMD,一种从语音自监督学习模型中提取歌唱导向多分辨率离散表示的新方法。具体而言,我们首先通过一项重合成任务来适配来自语音自监督学习的特征,并引入基于重采样的多分辨率模块,以更好地服务于歌唱生成。然后,这些适配后的多分辨率特征通过聚类进行离散化。大量实验证明,这些表示在歌唱声码器和歌唱语音合成中具有鲁棒性、高效性和有效性。