The wide deployment of speech-based biometric systems usually demands high-performance speaker recognition algorithms. However, most of the prior works for speaker recognition either process the speech in the frequency domain or time domain, which may produce suboptimal results because both time and frequency domains are important for speaker recognition. In this paper, we attempt to analyze the speech signal in both time and frequency domains and propose the time-frequency network~(TFN) for speaker recognition by extracting and fusing the features in the two domains. Based on the recent advance of deep neural networks, we propose a convolution neural network to encode the raw speech waveform and the frequency spectrum into domain-specific features, which are then fused and transformed into a classification feature space for speaker recognition. Experimental results on the publicly available datasets TIMIT and LibriSpeech show that our framework is effective to combine the information in the two domains and performs better than the state-of-the-art methods for speaker recognition.
翻译:基于语音的生物识别系统的广泛应用通常需要高性能的说话人识别算法。然而,现有的大多数说话人识别方法要么在频域处理语音,要么在时域处理语音,这可能导致次优结果,因为时域和频域对说话人识别都至关重要。本文尝试同时分析语音信号在时域和频域的特征,通过提取并融合两个域的特征,提出了一种用于说话人识别的时间-频率网络(TFN)。基于深度神经网络的最新进展,我们设计了一种卷积神经网络,用于将原始语音波形和频谱编码为域特定特征,随后这些特征被融合并转换为用于说话人识别的分类特征空间。在公开数据集TIMIT和LibriSpeech上的实验结果表明,我们的框架能够有效结合两个域的信息,其性能优于目前最先进的说话人识别方法。