Speech enhancement concerns the processes required to remove unwanted background sounds from the target speech to improve its quality and intelligibility. In this paper, a novel approach for single-channel speech enhancement is presented, using colored spectrograms. We propose the use of a deep neural network (DNN) architecture adapted from the pix2pix generative adversarial network (GAN) and train it over colored spectrograms of speech to denoise them. After denoising, the colors of spectrograms are translated to magnitudes of short-time Fourier transform (STFT) using a shallow regression neural network. These estimated STFT magnitudes are later combined with the noisy phases to obtain an enhanced speech. The results show an improvement of almost 0.84 points in the perceptual evaluation of speech quality (PESQ) and 1% in the short-term objective intelligibility (STOI) over the unprocessed noisy data. The gain in quality and intelligibility over the unprocessed signal is almost equal to the gain achieved by the baseline methods used for comparison with the proposed model, but at a much reduced computational cost. The proposed solution offers a comparative PESQ score at almost 10 times reduced computational cost than a similar baseline model that has generated the highest PESQ score trained on grayscaled spectrograms, while it provides only a 1% deficit in STOI at 28 times reduced computational cost when compared to another baseline system based on convolutional neural network-GAN (CNN-GAN) that produces the most intelligible speech.
翻译:语音增强涉及从目标语音中去除无关背景噪声以提升其质量和清晰度的过程。本文提出一种基于有色声谱图的单通道语音增强新方法。我们采用从pix2pix生成对抗网络(GAN)改编的深度神经网络(DNN)架构,通过训练使其对语音的有色声谱图进行去噪处理。去噪后,利用浅层回归神经网络将声谱图的色彩信息转换为短时傅里叶变换(STFT)的幅度值。这些估计出的STFT幅度随后与含噪相位相结合,从而生成增强后的语音。实验结果表明,相较于未处理的含噪数据,感知语音质量评估(PESQ)得分提升约0.84分,短时客观可懂度(STOI)指标提高1%。与基线方法相比,本模型在语音质量和可懂度上的提升效果相近,但计算成本显著降低。在获得可比PESQ得分的前提下,本方案的计算成本仅为在灰度声谱图上训练且取得最高PESQ得分的同类基线模型的十分之一;同时,相较于基于卷积神经网络-生成对抗网络(CNN-GAN)且产生最佳可懂度语音的另一个基线系统,本方案在计算成本降低28倍的情况下,STOI指标仅下降1%。