For real-time speech enhancement (SE) including noise suppression, dereverberation and acoustic echo cancellation, the time-variance of the audio signals becomes a severe challenge. The causality and memory usage limit that only the historical information can be used for the system to capture the time-variant characteristics. We propose to adaptively change the receptive field according to the input signal in deep neural network based SE model. Specifically, in an encoder-decoder framework, a dynamic attention span mechanism is introduced to all the attention modules for controlling the size of historical content used for processing the current frame. Experimental results verify that this dynamic mechanism can better track time-variant factors and capture speech-related characteristics, benefiting to both interference removing and speech quality retaining.
翻译:在实时语音增强(包括噪声抑制、去混响和声学回声消除)中,音频信号的时变性成为严峻挑战。因果性和内存限制使得系统只能利用历史信息来捕获时变特征。我们提出在基于深度神经网络的语音增强模型中,根据输入信号自适应地调整感受野。具体而言,在编码器-解码器框架中,引入动态注意跨度机制至所有注意力模块,以控制处理当前帧时所需历史信息的大小。实验结果表明,该动态机制能更好地跟踪时变因素并捕获语音相关特征,有助于抑制干扰和保持语音质量。