Speaker identification systems are deployed in diverse environments, often different from the lab conditions on which they are trained and tested. In this paper, first, we show the problem of generalization using fixed thresholds (computed using EER metric) for imposter identification in unseen speaker recognition and then introduce a robust speaker-specific thresholding technique for better performance. Secondly, inspired by the recent use of meta-learning techniques in speaker verification, we propose an end-to-end meta-learning framework for imposter detection which decouples the problem of imposter detection from unseen speaker identification. Thus, unlike most prior works that use some heuristics to detect imposters, the proposed network learns to detect imposters by leveraging the utterances of the enrolled speakers. Furthermore, we show the efficacy of the proposed techniques on VoxCeleb1, VCTK and the FFSVC 2022 datasets, beating the baselines by up to 10%.
翻译:说话人识别系统部署于多样化环境中,这些环境往往与模型训练和测试的实验室条件存在差异。本文首先揭示了在未知说话人识别场景中采用固定阈值(基于等错误率指标计算)进行冒名者识别时存在的泛化问题,继而提出了一种面向说话人的鲁棒阈值自适应技术以提升性能。其次,受元学习技术在说话人验证领域最新应用的启发,我们构建了一个用于冒名者检测的端到端元学习框架,该框架将冒名者检测问题与未知说话人识别进行解耦。与多数采用启发式规则检测冒名者的现有方法不同,本文提出的网络能够通过利用注册说话人的语音表征自主学习冒名者检测。此外,在VoxCeleb1、VCTK和FFSVC 2022数据集上的实验结果表明,所提方法较基线系统性能提升最高达10%。