Deep Attractor Network (DANet) is the state-of-the-art technique in speech separation field, which uses Bidirectional Long Short-Term Memory (BLSTM), but the complexity of the DANet model is very high. In this paper, a simplified and powerful DANet model is proposed using Bidirectional Gated neural network (BGRU) instead of BLSTM. The Gaussian Mixture Model (GMM) other than the k-means was applied in DANet as a clustering algorithm to reduce the complexity and increase the learning speed and accuracy. The metrics used in this paper are Signal to Distortion Ratio (SDR), Signal to Interference Ratio (SIR), Signal to Artifact Ratio (SAR), and Perceptual Evaluation Speech Quality (PESQ) score. Two speaker mixture datasets from TIMIT corpus were prepared to evaluate the proposed model, and the system achieved 12.3 dB and 2.94 for SDR and PESQ scores respectively, which were better than the original DANet model. Other improvements were 20.7% and 17.9% in the number of parameters and time training, respectively. The model was applied on mixed Arabic speech signals and the results were better than that in English.
翻译:深度吸引子网络(DANet)是语音分离领域的先进技术,该网络采用双向长短期记忆(BLSTM)结构,但其模型复杂度极高。本文提出一种简化且高效的DANet模型,采用双向门控神经网络(BGRU)替代BLSTM,并基于高斯混合模型(GMM)替代k-means聚类算法以降低复杂度、提升学习速度与精度。采用的评估指标包括信号失真比(SDR)、信号干扰比(SIR)、信号伪影比(SAR)及感知语音质量评估(PESQ)分数。基于TIMIT语料库构建双说话人混合数据集进行模型评估,系统分别获得12.3 dB的SDR值和2.94的PESQ分数,均优于原始DANet模型。参数数量和训练时间分别减少20.7%和17.9%。将该模型应用于混合阿拉伯语音信号时,其效果优于英语语音信号。