Cloned voices of popular singers sound increasingly realistic and have gained popularity over the past few years. They however pose a threat to the industry due to personality rights concerns. As such, methods to identify the original singer in synthetic voices are needed. In this paper, we investigate how singer identification methods could be used for such a task. We present three embedding models that are trained using a singer-level contrastive learning scheme, where positive pairs consist of segments with vocals from the same singers. These segments can be mixtures for the first model, vocals for the second, and both for the third. We demonstrate that all three models are highly capable of identifying real singers. However, their performance deteriorates when classifying cloned versions of singers in our evaluation set. This is especially true for models that use mixtures as an input. These findings highlight the need to understand the biases that exist within singer identification systems, and how they can influence the identification of voice deepfakes in music.
翻译:近年来,流行歌手的克隆声音日益逼真并广受欢迎。然而,由于人格权方面的担忧,这些克隆声音对音乐产业构成了威胁。因此,需要开发能够识别合成声音中原始歌手的方法。本文研究了如何将歌手识别方法应用于此类任务。我们提出了三种嵌入模型,这些模型采用歌手级别的对比学习方案进行训练,其中正样本对由来自同一歌手的含人声片段构成。对于第一个模型,这些片段可以是混音版本;对于第二个模型,可以是纯人声版本;对于第三个模型,则两者兼有。我们证明所有三个模型在识别真实歌手方面都表现出色。然而,在对评估集中的歌手克隆版本进行分类时,它们的性能出现下降。对于使用混音作为输入的模型,这种现象尤为明显。这些发现凸显了理解歌手识别系统中存在的偏见及其如何影响音乐中声音深度伪造识别的必要性。