The wide deployment of speech-based biometric systems usually demands high-performance speaker recognition algorithms. However, most of the prior works for speaker recognition either process the speech in the frequency domain or time domain, which may produce suboptimal results because both time and frequency domains are important for speaker recognition. In this paper, we attempt to analyze the speech signal in both time and frequency domains and propose the time-frequency network~(TFN) for speaker recognition by extracting and fusing the features in the two domains. Based on the recent advance of deep neural networks, we propose a convolution neural network to encode the raw speech waveform and the frequency spectrum into domain-specific features, which are then fused and transformed into a classification feature space for speaker recognition. Experimental results on the publicly available datasets TIMIT and LibriSpeech show that our framework is effective to combine the information in the two domains and performs better than the state-of-the-art methods for speaker recognition.
翻译:基于语音的生物识别系统广泛应用通常需要高性能的说话人识别算法。然而,现有大部分说话人识别研究仅在频域或时域处理语音信号,由于时域和频域对说话人识别均至关重要,这可能导致次优结果。本文尝试同时分析语音信号在时域和频域中的特征,提出用于说话人识别的时频网络(TFN),通过提取并融合两个域的特征实现识别。基于深度神经网络的最新进展,我们提出用卷积神经网络将原始语音波形和频谱编码为域特定特征,随后融合并变换至分类特征空间用于说话人识别。在公开数据集TIMIT和LibriSpeech上的实验表明,该框架能有效融合双域信息,其性能优于当前最先进的说话人识别方法。