Speaker recognition is a biometric modality that utilizes the speaker's speech segments to recognize the identity, determining whether the test speaker belongs to one of the enrolled speakers. In order to improve the robustness of the i-vector framework on cross-channel conditions and explore the nova method for applying deep learning to speaker recognition, the Stacked Auto-encoders are used to get the abstract extraction of the i-vector instead of applying PLDA. After pre-processing and feature extraction, the speaker and channel-independent speeches are employed for UBM training. The UBM is then used to extract the i-vector of the enrollment and test speech. Unlike the traditional i-vector framework, which uses linear discriminant analysis (LDA) to reduce dimension and increase the discrimination between speaker subspaces, this research use stacked auto-encoders to reconstruct the i-vector with lower dimension and different classifiers can be chosen to achieve final classification. The experimental results show that the proposed method achieves better performance than the state-of-the-art method.
翻译:说话人识别是一种利用说话人语音片段来识别身份的 biometric 模态,用于判断测试说话人是否属于已注册说话人之一。为提高 i-vector 框架在跨信道条件下的鲁棒性,并探索将深度学习应用于说话人识别的新方法,本文采用堆叠自编码器对 i-vector 进行抽象提取,以替代 PLDA。在预处理和特征提取之后,使用说话人与信道无关的语音进行 UBM 训练。随后,利用该 UBM 提取注册语音和测试语音的 i-vector。与传统 i-vector 框架使用线性判别分析(LDA)来降维并增强说话人子空间区分性不同,本研究采用堆叠自编码器重构低维 i-vector,并可选用不同分类器实现最终分类。实验结果表明,所提方法在性能上优于现有最先进方法。