Deep neural networks have shown excellent prospects in speech separation tasks. However, obtaining good results while keeping a low model complexity remains challenging in real-world applications. In this paper, we provide a bio-inspired efficient encoder-decoder architecture by mimicking the brain's top-down attention, called TDANet, with decreased model complexity without sacrificing performance. The top-down attention in TDANet is extracted by the global attention (GA) module and the cascaded local attention (LA) layers. The GA module takes multi-scale acoustic features as input to extract global attention signal, which then modulates features of different scales by direct top-down connections. The LA layers use features of adjacent layers as input to extract the local attention signal, which is used to modulate the lateral input in a top-down manner. On three benchmark datasets, TDANet consistently achieved competitive separation performance to previous state-of-the-art (SOTA) methods with higher efficiency. Specifically, TDANet's multiply-accumulate operations (MACs) are only 5\% of Sepformer, one of the previous SOTA models, and CPU inference time is only 10\% of Sepformer. In addition, a large-size version of TDANet obtained SOTA results on three datasets, with MACs still only 10\% of Sepformer and the CPU inference time only 24\% of Sepformer.
翻译:深度神经网络在语音分离任务中展现出卓越前景。然而,在实际应用中,如何在保持低模型复杂度的同时获得良好性能仍具挑战性。本文通过模拟大脑的自上而下注意力机制,提出一种受生物启发的TDANet高效编码器-解码器架构,在降低模型复杂度的同时不牺牲性能。TDANet中的自上而下注意力由全局注意力(GA)模块和级联局部注意力(LA)层提取。GA模块以多尺度声学特征为输入,提取全局注意力信号,并通过直接自上而下连接调制不同尺度的特征。LA层以相邻层特征为输入提取局部注意力信号,以自上而下的方式调制侧向输入。在三个基准数据集上,TDANet在保持更高效率的同时,始终能达到与先前最先进(SOTA)方法相竞争的分离性能。具体而言,TDANet的乘加运算量(MACs)仅为先前SOTA模型之一Sepformer的5%,CPU推理时间仅为Sepformer的10%。此外,TDANet的大尺寸版本在三个数据集上均取得SOTA结果,其MACs仍仅为Sepformer的10%,CPU推理时间仅为Sepformer的24%。