In this paper, we propose a dual-stage architecture for bandwidth extension (BWE) increasing the effective sampling rate of speech signals from 8 kHz to 48 kHz. Unlike existing end-to-end deep learning models, our proposed method explicitly models BWE using excitation and linear time-varying (LTV) filter stages. The excitation stage broadens the spectrum of the input, while the filtering stage properly shapes it based on outputs from an acoustic feature predictor. To this end, an acoustic feature loss term can implicitly promote the excitation subnetwork to produce white spectra in the upper frequency band to be synthesized. Experimental results demonstrate that the added inductive bias provided by our approach can improve upon BWE results using the generators from both SEANet or HiFi-GAN as exciters, and that our means of adapting processing with acoustic feature predictions is more effective than that used in HiFi-GAN-2. Secondary contributions include extensions of the SEANet model to accommodate local conditioning information, as well as the application of HiFi-GAN-2 for the BWE problem.
翻译:本文提出一种用于带宽扩展的双阶段架构,可将语音信号的有效采样率从8 kHz提升至48 kHz。与现有的端到端深度学习模型不同,所提方法通过激励级与线性时变滤波器级对带宽扩展过程进行显式建模。激励级负责拓宽输入信号的频谱,而滤波级则根据声学特征预测器的输出对频谱进行精确塑形。为此,声学特征损失项可隐式引导激励子网络在待合成的高频段产生白化频谱。实验结果表明:采用SEANet或HiFi-GAN作为激励器时,本方法所提供的归纳偏置能有效改善带宽扩展效果;且基于声学特征预测的自适应处理方法比HiFi-GAN-2所采用的策略更为有效。次要贡献包括:扩展SEANet模型以适配局部条件信息,以及将HiFi-GAN-2应用于带宽扩展问题。