Deep learning-based speech enhancement (SE) methods often face significant computational challenges when needing to meet low-latency requirements because of the increased number of frames to be processed. This paper introduces the SlowFast framework which aims to reduce computation costs specifically when low-latency enhancement is needed. The framework consists of a slow branch that analyzes the acoustic environment at a low frame rate, and a fast branch that performs SE in the time domain at the needed higher frame rate to match the required latency. Specifically, the fast branch employs a state space model where its state transition process is dynamically modulated by the slow branch. Experiments on a SE task with a 2 ms algorithmic latency requirement using the Voice Bank + Demand dataset show that our approach reduces computation cost by 70% compared to a baseline single-branch network with equivalent parameters, without compromising enhancement performance. Furthermore, by leveraging the SlowFast framework, we implemented a network that achieves an algorithmic latency of just 60 {\mu}s (one sample point at 16 kHz sample rate) with a computation cost of 100 M MACs/s, while scoring a PESQ-NB of 3.12 and SISNR of 16.62.
翻译:基于深度学习的语音增强方法在需要满足低延迟要求时,由于待处理的帧数增加,常面临显著的计算挑战。本文引入慢快框架,旨在专门降低低延迟增强场景下的计算成本。该框架包含一个以低帧率分析声学环境的慢分支,以及一个在时域以所需更高帧率执行语音增强的快分支,以满足指定的延迟要求。具体而言,快分支采用状态空间模型,其状态转移过程由慢分支动态调制。在Voice Bank + Demand数据集上进行的、要求2毫秒算法延迟的语音增强任务实验表明,与参数规模相当的单分支基线网络相比,我们的方法在保持增强性能不变的同时,将计算成本降低了70%。此外,通过利用慢快框架,我们实现了一个网络,其算法延迟仅为60微秒(在16 kHz采样率下对应一个采样点),计算成本为100 M MACs/s,同时获得3.12的PESQ-NB分数和16.62的SISNR分数。