Zero-shot text-to-speech models can clone a speaker's timbre from a short reference audio, but they also strongly inherit the speaking style present in the reference. As a result, synthesizing speech with a desired style often requires carefully selecting reference audio, which is impractical when only limited or mismatched references are available. While recent controllable TTS methods attempt to address this issue, they typically rely on absolute style targets and discrete textual prompts, and therefore do not support continuous and reference-relative style control. We propose ReStyle-TTS, a framework that enables continuous and reference-relative style control in zero-shot TTS. Our key insight is that effective style control requires first reducing the model's implicit dependence on reference style before introducing explicit control mechanisms. To this end, we introduce Decoupled Classifier-Free Guidance (DCFG), which independently controls text and reference guidance, reducing reliance on reference style while preserving text fidelity. On top of this, we apply style-specific LoRAs together with Orthogonal LoRA Fusion to enable continuous and disentangled multi-attribute control, and introduce a Timbre Consistency Optimization module to mitigate timbre drift caused by weakened reference guidance. Experiments show that ReStyle-TTS enables user-friendly, continuous, and relative control over pitch, energy, and multiple emotions while maintaining intelligibility and speaker timbre, and performs robustly in challenging mismatched reference-target style scenarios.
翻译:零样本文本转语音模型能够从一段简短的参考音频中克隆说话者的音色,但它们也会强烈继承参考音频中存在的说话风格。因此,要合成具有期望风格的语音,通常需要精心挑选参考音频,这在仅有有限或不匹配的参考音频可用时是不切实际的。虽然最近的可控TTS方法试图解决这个问题,但它们通常依赖于绝对风格目标和离散的文本提示,因此不支持连续且相对于参考的风格控制。我们提出了ReStyle-TTS,一个能够在零样本TTS中实现连续且相对于参考的风格控制的框架。我们的核心见解是,有效的风格控制需要首先降低模型对参考风格的隐式依赖,然后再引入显式的控制机制。为此,我们引入了解耦的无分类器指导,它独立地控制文本和参考指导,在保持文本保真度的同时减少对参考风格的依赖。在此基础上,我们应用风格特定的LoRA并结合正交LoRA融合,以实现连续且解耦的多属性控制,并引入了一个音色一致性优化模块,以减轻因参考指导减弱而导致的音色漂移。实验表明,ReStyle-TTS能够对音高、能量和多种情感进行用户友好、连续且相对的控制,同时保持可懂度和说话者音色,并在具有挑战性的参考-目标风格不匹配场景中表现稳健。