DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis

This paper presents DAIEN-TTS, a zero-shot text-to-speech (TTS) framework that enables ENvironment-aware synthesis through Disentangled Audio Infilling. By leveraging separate speaker and environment prompts, DAIEN-TTS allows independent control over the timbre and the background environment of the synthesized speech. Built upon F5-TTS, the proposed DAIEN-TTS first incorporates a pretrained speech-environment separation (SES) module to disentangle the environmental speech into mel-spectrograms of clean speech and environment audio. Two random span masks of varying lengths are then applied to both mel-spectrograms, which, together with the text embedding, serve as conditions for infilling the masked environmental mel-spectrogram, enabling the simultaneous continuation of personalized speech and time-varying environmental audio. To further enhance controllability during inference, we adopt dual classifier-free guidance (DCFG) for the speech and environment components and introduce a signal-to-noise ratio (SNR) adaptation strategy to align the synthesized speech with the environment prompt. Experimental results demonstrate that DAIEN-TTS generates environmental personalized speech with high naturalness, strong speaker similarity, and high environmental fidelity.

翻译：本文提出了DAIEN-TTS，一个零样本文本到语音（TTS）框架，它通过解耦音频填充实现了环境感知的语音合成。通过利用独立的说话人和环境提示，DAIEN-TTS允许对合成语音的音色和背景环境进行独立控制。该框架基于F5-TTS构建，首先引入一个预训练的语音-环境分离（SES）模块，将环境语音解耦为纯净语音和环境音频的梅尔频谱图。随后，对两个梅尔频谱图分别应用不同长度的随机跨度掩码，这些掩码与文本嵌入一起，作为填充被掩码的环境梅尔频谱图的条件，从而实现个性化语音和时变环境音频的同步延续。为了在推理过程中进一步增强可控性，我们采用了针对语音和环境组件的双重无分类器引导（DCFG），并引入了信噪比（SNR）自适应策略，以使合成语音与环境提示对齐。实验结果表明，DAIEN-TTS能够生成具有高自然度、强说话人相似性和高环境保真度的环境个性化语音。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

【普林斯顿博士论文】用于语音的生成式通用模型

专知会员服务

19+阅读 · 2025年12月3日

迈向可控语音合成：大语言模型时代的综述

专知会员服务

24+阅读 · 2024年12月13日

【2023新书】神经文本到语音合成，214页pdf

专知会员服务

39+阅读 · 2023年6月9日

【AAAI2023】DPText-DETR: 基于动态点query的场景文本检测，更高更快更鲁棒

专知会员服务

17+阅读 · 2023年1月23日