The goal of this paper is to learn robust speaker representation for bilingual speaking scenario. The majority of the world's population speak at least two languages; however, most speaker recognition systems fail to recognise the same speaker when speaking in different languages. Popular speaker recognition evaluation sets do not consider the bilingual scenario, making it difficult to analyse the effect of bilingual speakers on speaker recognition performance. In this paper, we publish a large-scale evaluation set named VoxCeleb1-B derived from VoxCeleb that considers bilingual scenarios. We introduce an effective disentanglement learning strategy that combines adversarial and metric learning-based methods. This approach addresses the bilingual situation by disentangling language-related information from speaker representation while ensuring stable speaker representation learning. Our language-disentangled learning method only uses language pseudo-labels without manual information.
翻译:本文旨在针对双语场景学习鲁棒的说话人表示。全球大多数人口至少掌握两种语言,但多数说话人识别系统无法识别同一说话人在使用不同语言时的身份。当前流行的说话人识别评估集未考虑双语场景,导致难以分析双语说话人对系统性能的影响。本文发布了基于VoxCeleb构建的大规模双语评估集VoxCeleb1-B。我们提出了一种结合对抗学习与度量学习的有效解耦学习策略,该方法通过从说话人表示中分离语言相关信息,同时确保稳定的说话人表示学习,从而解决双语场景问题。我们提出的语言解耦学习方法仅使用语言伪标签,无需人工标注信息。