The majority of existing speech bandwidth extension (BWE) methods operate under the constraint of fixed source and target sampling rates, which limits their flexibility in practical applications. In this paper, we propose a multi-stage speech BWE model named MS-BWE, which can handle a set of source and target sampling rate pairs and achieve flexible extensions of frequency bandwidth. The proposed MS-BWE model comprises a cascade of BWE blocks, with each block featuring a dual-stream architecture to realize amplitude and phase extension, progressively painting the speech frequency bands stage by stage. The teacher-forcing strategy is employed to mitigate the discrepancy between training and inference. Experimental results demonstrate that our proposed MS-BWE is comparable to state-of-the-art speech BWE methods in speech quality. Regarding generation efficiency, the one-stage generation of MS-BWE can achieve over one thousand times real-time on GPU and about sixty times on CPU.
翻译:现有大多数语音带宽扩展方法受限于固定的源采样率和目标采样率,这限制了其在实际应用中的灵活性。本文提出一种名为MS-BWE的多阶段语音带宽扩展模型,该模型能够处理一组源采样率与目标采样率组合,实现频率带宽的灵活扩展。所提出的MS-BWE模型由级联的BWE模块构成,每个模块采用双流架构分别实现幅度扩展与相位扩展,逐阶段完成语音频带的渐进式构建。通过采用教师强制策略来减小训练与推理阶段的差异。实验结果表明,所提出的MS-BWE在语音质量方面与当前最先进的语音带宽扩展方法相当。在生成效率方面,MS-BWE的单阶段生成在GPU上可实现超过一千倍的实时处理速度,在CPU上可实现约六十倍的实时处理速度。