Instruct Text-to-Speech (InstructTTS) leverages natural language descriptions as style prompts to guide speech synthesis. However, existing InstructTTS methods mainly rely on a direct combination of audio-related labels or their diverse rephrasings, making it difficult to handle flexible, high-level instructions. Such rigid control is insufficient for users such as content creators who wish to steer generation with descriptive instructions. To address these constraints, we introduce OV-InstructTTS, a new paradigm for open-vocabulary InstructTTS. We propose a comprehensive solution comprising a newly curated dataset, OV-Speech, and a novel reasoning-driven framework. The OV-Speech dataset pairs speech with open-vocabulary instructions, each augmented with a reasoning process that connects high-level instructions to acoustic features. The reasoning-driven framework infers emotional, acoustic, and paralinguistic information from open-vocabulary instructions before synthesizing speech. Evaluations show that this reasoning-driven approach significantly improves instruction-following fidelity and speech expressiveness. We believe this work can inspire the next user-friendly InstructTTS systems with stronger generalization and real-world applicability. The dataset and demos are publicly available on our project page.
翻译:指令文本转语音(InstructTTS)利用自然语言描述作为风格提示来引导语音合成。然而,现有的InstructTTS方法主要依赖于音频相关标签或其多样化改写的直接组合,难以处理灵活的高层指令。这种僵化的控制对于希望使用描述性指令引导生成的内容创作者等用户而言是不够的。为解决这些限制,我们提出了OV-InstructTTS,一种开放词汇指令文本转语音的新范式。我们提出了一个包含新构建的数据集OV-Speech和一个新颖的推理驱动框架的完整解决方案。OV-Speech数据集将语音与开放词汇指令配对,每条指令都通过一个将高层指令与声学特征相连接的推理过程进行增强。该推理驱动框架在合成语音之前,先从开放词汇指令中推断情感、声学和副语言信息。评估表明,这种推理驱动方法显著提高了指令跟随的保真度和语音表现力。我们相信这项工作能够启发下一代具有更强泛化能力和现实适用性的用户友好型InstructTTS系统。数据集和演示已在我们的项目页面公开提供。