This paper presents DAIEN-TTS, a zero-shot text-to-speech (TTS) framework that enables ENvironment-aware synthesis through Disentangled Audio Infilling. By leveraging separate speaker and environment prompts, DAIEN-TTS allows independent control over the timbre and the background environment of the synthesized speech. Built upon F5-TTS, the proposed DAIEN-TTS first incorporates a pretrained speech-environment separation (SES) module to disentangle the environmental speech into mel-spectrograms of clean speech and environment audio. Two random span masks of varying lengths are then applied to both mel-spectrograms, which, together with the text embedding, serve as conditions for infilling the masked environmental mel-spectrogram, enabling the simultaneous continuation of personalized speech and time-varying environmental audio. To further enhance controllability during inference, we adopt dual classifier-free guidance (DCFG) for the speech and environment components and introduce a signal-to-noise ratio (SNR) adaptation strategy to align the synthesized speech with the environment prompt. Experimental results demonstrate that DAIEN-TTS generates environmental personalized speech with high naturalness, strong speaker similarity, and high environmental fidelity.
翻译:本文提出了DAIEN-TTS,一个零样本文本到语音(TTS)框架,它通过解耦音频填充实现了环境感知的语音合成。通过利用独立的说话人和环境提示,DAIEN-TTS允许对合成语音的音色和背景环境进行独立控制。该框架基于F5-TTS构建,首先引入一个预训练的语音-环境分离(SES)模块,将环境语音解耦为纯净语音和环境音频的梅尔频谱图。随后,对两个梅尔频谱图分别应用不同长度的随机跨度掩码,这些掩码与文本嵌入一起,作为填充被掩码的环境梅尔频谱图的条件,从而实现个性化语音和时变环境音频的同步延续。为了在推理过程中进一步增强可控性,我们采用了针对语音和环境组件的双重无分类器引导(DCFG),并引入了信噪比(SNR)自适应策略,以使合成语音与环境提示对齐。实验结果表明,DAIEN-TTS能够生成具有高自然度、强说话人相似性和高环境保真度的环境个性化语音。