The real-time processing of time series signals is a critical issue for many real-life applications. The idea of real-time processing is especially important in audio domain as the human perception of sound is sensitive to any kind of disturbance in perceived signals, especially the lag between auditory and visual modalities. The rise of deep learning (DL) models complicated the landscape of signal processing. Although they often have superior quality compared to standard DSP methods, this advantage is diminished by higher latency. In this work we propose novel method for minimization of inference time latency and memory consumption, called Short-Term Memory Convolution (STMC) and its transposed counterpart. The main advantage of STMC is the low latency comparable to long short-term memory (LSTM) networks. Furthermore, the training of STMC-based models is faster and more stable as the method is based solely on convolutional neural networks (CNNs). In this study we demonstrate an application of this solution to a U-Net model for a speech separation task and GhostNet model in acoustic scene classification (ASC) task. In case of speech separation we achieved a 5-fold reduction in inference time and a 2-fold reduction in latency without affecting the output quality. The inference time for ASC task was up to 4 times faster while preserving the original accuracy.
翻译:时间序列信号的实时处理是许多实际应用中的关键问题。实时处理的思想在音频领域尤为重要,因为人类对声音的感知对感知信号中的任何干扰都很敏感,尤其是听觉与视觉模态之间的延迟。深度学习模型的兴起使信号处理领域变得复杂。尽管这些模型通常比标准数字信号处理方法具有更优的质量,但这一优势被较高的延迟所削弱。本文提出了一种新颖的方法来最小化推理时间延迟和内存消耗,称为短时记忆卷积(STMC)及其转置对应方法。STMC的主要优势在于其低延迟,可与长短期记忆(LSTM)网络相媲美。此外,由于该方法完全基于卷积神经网络,基于STMC的模型训练更快且更稳定。在本研究中,我们展示了该解决方案在语音分离任务中应用于U-Net模型,以及在声学场景分类任务中应用于GhostNet模型的效果。在语音分离任务中,我们实现了推理时间减少5倍、延迟减少2倍,且不影响输出质量。在声学场景分类任务中,推理速度提升高达4倍,同时保持了原有的准确性。