Speech enhancement is a demanding task in automated speech processing pipelines, focusing on separating clean speech from noisy channels. Transformer based models have recently bested RNN and CNN models in speech enhancement, however at the same time they are much more computationally expensive and require much more high quality training data, which is always hard to come by. In this paper, we present an improvement for speech enhancement models that maintains the expressiveness of self-attention while significantly reducing model complexity, which we have termed Spectrum Attention Fusion. We carefully construct a convolutional module to replace several self-attention layers in a speech Transformer, allowing the model to more efficiently fuse spectral features. Our proposed model is able to achieve comparable or better results against SOTA models but with significantly smaller parameters (0.58M) on the Voice Bank + DEMAND dataset.
翻译:语音增强是自动语音处理流程中的一项重要任务,旨在从嘈杂信道中分离出纯净语音。近年来,基于Transformer的模型在语音增强中已超越RNN和CNN模型,但与此同时,这些模型计算开销更大,且需要更多难以获取的高质量训练数据。本文提出一种语音增强模型的改进方法,在保持自注意力表达能力的同时显著降低模型复杂度,我们将其命名为频谱注意力融合。通过精心构建卷积模块替代语音Transformer中的多个自注意力层,使模型能够更高效地融合频谱特征。在Voice Bank + DEMAND数据集上,我们提出的模型以显著更小的参数量(0.58M)取得了与当前最优模型相当甚至更优的结果。