Multilingual speaker verification remains challenging because language-dependent acoustic variability causes speaker identity to become entangled with linguistic characteristics, degrading generalization across languages. In multilingual training, embeddings often encode language cues with speaker identity, causing speakers to form language-specific clusters. We propose L-Proto, a language-aware episodic prototypical training strategy that constructs language-consistent episodes. By sampling speakers from a single language per episode, L-Proto reduces language-driven variation during training and encourages embeddings to focus more directly on speaker identity. Experiments on the TidyVoice Challenge benchmark demonstrate consistent performance improvements over conventional fine-tuning and random episodic sampling across multiple backbone architectures.
翻译:多语言说话人验证仍面临挑战,因为语言相关的声学变异性使得说话人身份与语言特征相互纠缠,降低了跨语言的泛化能力。在多语言训练中,嵌入向量常与说话人身份共同编码语言线索,导致说话人形成语言特定的聚类。我们提出L-Proto,一种语言感知的情节原型训练策略,该策略构建语言一致的训练情节。通过在每个情节中从单一语言采样说话人,L-Proto减少了训练期间由语言驱动的变异性,并促使嵌入向量更直接地聚焦于说话人身份。在TidyVoice Challenge基准上的实验表明,与传统的微调和随机情节采样相比,该方法在多种骨干架构上均实现了一致的性能提升。