The goal of this paper is to learn robust speaker representation for bilingual speaking scenario. The majority of the world's population speak at least two languages; however, most speaker recognition systems fail to recognise the same speaker when speaking in different languages. Popular speaker recognition evaluation sets do not consider the bilingual scenario, making it difficult to analyse the effect of bilingual speakers on speaker recognition performance. In this paper, we publish a large-scale evaluation set named VoxCeleb1-B derived from VoxCeleb that considers bilingual scenarios. We introduce an effective disentanglement learning strategy that combines adversarial and metric learning-based methods. This approach addresses the bilingual situation by disentangling language-related information from speaker representation while ensuring stable speaker representation learning. Our language-disentangled learning method only uses language pseudo-labels without manual information.
翻译:本文旨在研究双语场景下鲁棒的说话人表示学习。全球大多数人至少会说两种语言,然而大多数说话人识别系统在说话人使用不同语言时无法正确识别同一说话人。现有的主流说话人识别评测集未考虑双语场景,导致难以分析双语说话人对识别性能的影响。本文基于VoxCeleb数据集构建并发布了一个大规模双语评测集VoxCeleb1-B。我们提出一种结合对抗学习与度量学习的有效解耦学习策略,该方法通过从说话人表示中分离语言相关信息,同时确保稳定的说话人表示学习,从而解决双语场景下的识别问题。我们的语言解耦学习方法仅使用语言伪标签,无需任何人工标注信息。