Properly setting up recording conditions, including microphone type and placement, room acoustics, and ambient noise, is essential to obtaining the desired acoustic characteristics of speech. In this paper, we propose Diff-R-EN-T, a Diffusion model for Recording ENvironment Transfer which transforms the input speech to have the recording conditions of a reference speech while preserving the speech content. Our model comprises the content enhancer, the recording environment encoder, and the diffusion decoder which generates the target mel-spectrogram by utilizing both enhancer and encoder as input conditions. We evaluate DiffRENT in the speech enhancement and acoustic matching scenarios. The results show that DiffRENT generalizes well to unseen environments and new speakers. Also, the proposed model achieves superior performances in objective and subjective evaluation. Sound examples of our proposed model are available online.
翻译:合理设置录音条件,包括麦克风类型与位置、房间声学特性及环境噪声,对于获取期望的语音声学特征至关重要。本文提出Diff-R-EN-T——一种面向录音环境迁移的扩散模型,该模型可将输入语音转换为具有目标参考语音录音条件的语音,同时保留其语音内容。该模型由内容增强器、录音环境编码器以及扩散解码器三部分组成,其中扩散解码器通过将内容增强器与录音环境编码器的输出作为输入条件,生成目标梅尔频谱图。我们在语音增强与声学匹配场景下评估了DiffRENT模型。结果表明,DiffRENT对未见过的环境和新说话人具有良好的泛化能力。此外,该模型在客观与主观评价中均取得了优越性能。本文所提模型的音频示例已在线公开。