Gaps, dropouts and short clips of corrupted audio are a common problem and particularly annoying when they occur in speech. This paper uses machine learning to regenerate gaps of up to 320ms in an audio speech signal. Audio regeneration is translated into image regeneration by transforming audio into a Mel-spectrogram and using image in-painting to regenerate the gaps. The full Mel-spectrogram is then transferred back to audio using the Parallel-WaveGAN vocoder and integrated into the audio stream. Using a sample of 1300 spoken audio clips of between 1 and 10 seconds taken from the publicly-available LJSpeech dataset our results show regeneration of audio gaps in close to real time using GANs with a GPU equipped system. As expected, the smaller the gap in the audio, the better the quality of the filled gaps. On a gap of 240ms the average mean opinion score (MOS) for the best performing models was 3.737, on a scale of 1 (worst) to 5 (best) which is sufficient for a human to perceive as close to uninterrupted human speech.
翻译:音频中的间隙、丢失片段及短时损坏是常见问题,尤其在语音信号中更为恼人。本文利用机器学习方法再生语音音频信号中长达320毫秒的间隙。通过将音频转换为梅尔频谱图,并采用图像修复技术对间隙区域进行再生,可实现音频再生与图像再生之间的转化。随后利用Parallel-WaveGAN声码器将完整的梅尔频谱图还原为音频,并集成至音频流中。我们采用公开LJSpeech数据集中1300段时长1至10秒的语音音频片段进行实验,结果表明基于配备GPU的系统和生成对抗网络(GANs)可实现接近实时的音频间隙再生。正如预期,音频间隙越小,填充质量越高。针对240毫秒间隙,最优模型的平均意见得分(MOS)为3.737(评分范围1分(最差)至5分(最好)),该结果足以使人类感知为接近无中断的连续语音。