We present a novel model designed for resource-efficient multichannel speech enhancement in the time domain, with a focus on low latency, lightweight, and low computational requirements. The proposed model incorporates explicit spatial and temporal processing within deep neural network (DNN) layers. Inspired by frequency-dependent multichannel filtering, our spatial filtering process applies multiple trainable filters to each hidden unit across the spatial dimension, resulting in a multichannel output. The temporal processing is applied over a single-channel output stream from the spatial processing using a Long Short-Term Memory (LSTM) network. The output from the temporal processing stage is then further integrated into the spatial dimension through elementwise multiplication. This explicit separation of spatial and temporal processing results in a resource-efficient network design. Empirical findings from our experiments show that our proposed model significantly outperforms robust baseline models while demanding far fewer parameters and computations, while achieving an ultra-low algorithmic latency of just 2 milliseconds.
翻译:我们提出了一种新颖的模型,专为时域中的资源高效多通道语音增强而设计,重点在于低延迟、轻量化和低计算需求。所提出的模型在深度神经网络(DNN)层内融合了显式的空间与时间处理。受频率相关多通道滤波的启发,我们的空间滤波过程对空间维度上的每个隐藏单元应用多个可训练滤波器,从而生成多通道输出。时间处理则通过长短期记忆网络(LSTM)对空间处理输出的单通道流进行运算。时间处理阶段的输出随后通过逐元素乘法进一步整合到空间维度中。这种对空间和时间处理的显式分离实现了资源高效的网络设计。实验中的实证结果表明,所提出的模型在显著减少参数和计算量的同时,大幅超越了鲁棒基线模型,并实现了仅2毫秒的超低算法延迟。