Single-channel speech enhancement algorithms are often used in resource-constrained embedded devices, where low latency and low complexity designs gain more importance. In recent years, researchers have proposed a wide variety of novel solutions to this problem. In particular, a recent deep learning model named ULCNet is among the state-of-the-art approaches in this domain. This paper proposes an adaptation of ULCNet, by replacing its GRU layers with FastGRNNs, to reduce both computational latency and complexity. Furthermore, this paper shows empirical evidence on the performance decay of FastGRNNs in long audio signals during inference due to internal state drifting, and proposes a novel approach based on a trainable complementary filter to mitigate it. The resulting model, Fast-ULCNet, performs on par with the state-of-the-art original ULCNet architecture on a speech enhancement task, while reducing its model size by more than half and decreasing its latency by 34% on average.
翻译:单通道语音增强算法常被用于资源受限的嵌入式设备中,因此低延迟和低复杂度的设计显得尤为重要。近年来,研究者们针对该问题提出了多种新颖的解决方案。特别是,一种名为ULCNet的深度学习模型已成为该领域最先进的方法之一。本文提出对ULCNet的一种改进,通过将其GRU层替换为FastGRNN,以同时降低计算延迟和复杂度。此外,本文通过实验证据揭示了FastGRNN在推理过程中处理长音频信号时,因内部状态漂移而导致的性能下降问题,并提出了一种基于可训练互补滤波器的新方法来缓解该问题。最终得到的模型——Fast-ULCNet,在语音增强任务上的表现与最先进的原始ULCNet架构相当,同时其模型大小减少了一半以上,平均延迟降低了34%。