Significant strides have been made in creating voice identity representations using speech data. However, the same level of progress has not been achieved for singing voices. To bridge this gap, we suggest a framework for training singer identity encoders to extract representations suitable for various singing-related tasks, such as singing voice similarity and synthesis. We explore different self-supervised learning techniques on a large collection of isolated vocal tracks and apply data augmentations during training to ensure that the representations are invariant to pitch and content variations. We evaluate the quality of the resulting representations on singer similarity and identification tasks across multiple datasets, with a particular emphasis on out-of-domain generalization. Our proposed framework produces high-quality embeddings that outperform both speaker verification and wav2vec 2.0 pre-trained baselines on singing voice while operating at 44.1 kHz. We release our code and trained models to facilitate further research on singing voice and related areas.
翻译:在利用语音数据创建语音身份表征方面已取得显著进展。然而,歌唱声音的类似进展尚未达到同等水平。为弥合这一差距,我们提出一个训练歌手身份编码器的框架,以提取适用于歌唱声音相似度与合成等多项歌唱相关任务的表征。我们在大规模孤立人声轨道数据集上探索不同的自监督学习技术,并在训练过程中应用数据增强,确保表征对音高和内容变化具有不变性。我们跨多个数据集评估所得表征在歌手相似度与识别任务中的质量,特别关注域外泛化能力。我们提出的框架能生成高质量嵌入,在44.1 kHz采样率下,其歌唱声音性能优于说话人验证和wav2vec 2.0预训练基线模型。我们公开代码与训练模型,以促进歌唱声音及相关领域的进一步研究。