In recent years, much speech separation research has focused primarily on improving model performance. However, for low-latency speech processing systems, high efficiency is equally important. Therefore, we propose a speech separation model with significantly reduced parameters and computational costs: Time-frequency Interleaved Gain Extraction and Reconstruction network (TIGER). TIGER leverages prior knowledge to divide frequency bands and compresses frequency information. We employ a multi-scale selective attention module to extract contextual features, while introducing a full-frequency-frame attention module to capture both temporal and frequency contextual information. Additionally, to more realistically evaluate the performance of speech separation models in complex acoustic environments, we introduce a dataset called EchoSet. This dataset includes noise and more realistic reverberation (e.g., considering object occlusions and material properties), with speech from two speakers overlapping at random proportions. Experimental results showed that models trained on EchoSet had better generalization ability than those trained on other datasets to the data collected in the physical world, which validated the practical value of the EchoSet. On EchoSet and real-world data, TIGER significantly reduces the number of parameters by 94.3% and the MACs by 95.3% while achieving performance surpassing state-of-the-art (SOTA) model TF-GridNet. This is the first speech separation model with fewer than 1 million parameters that achieves performance comparable to the SOTA model.
翻译:近年来,语音分离研究主要聚焦于提升模型性能。然而,对于低延迟语音处理系统而言,高效率同样至关重要。为此,我们提出了一种参数量和计算成本显著降低的语音分离模型:时频交错增益提取与重建网络(TIGER)。TIGER利用先验知识划分频带并压缩频率信息。我们采用多尺度选择性注意力模块提取上下文特征,同时引入全频帧注意力模块以捕捉时域和频域的上下文信息。此外,为更真实地评估语音分离模型在复杂声学环境中的性能,我们引入了一个名为EchoSet的数据集。该数据集包含噪声和更真实的混响(例如,考虑物体遮挡和材料特性),且包含两名说话人按随机比例重叠的语音。实验结果表明,在EchoSet上训练的模型相较于在其他数据集上训练的模型,对物理世界采集的数据具有更好的泛化能力,这验证了EchoSet的实用价值。在EchoSet和真实世界数据上,TIGER在性能超越最先进(SOTA)模型TF-GridNet的同时,参数量显著减少94.3%,乘累加运算量减少95.3%。这是首个参数量低于100万且性能与SOTA模型相当的语音分离模型。