Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks. See https://aka.ms/speechx for demo samples.
翻译:基于音频-文本提示的生成式语音模型的最新进展,已催生出高质量零样本文本到语音等显著创新。然而,现有模型在处理涉及输入语音转换及在不利声学条件下捕获音频处理的多样化音频-文本语音生成任务时,仍面临局限性。本文提出SpeechX,一种能够处理零样本TTS及多种语音转换任务、同时应对纯净与含噪信号的通用语音生成模型。SpeechX将神经编解码语言建模与基于任务相关提示的多任务学习相结合,实现了统一且可扩展的建模,并为语音增强与转换任务中利用文本输入提供了一致的方法。实验结果表明,SpeechX在零样本TTS、噪声抑制、目标说话人提取、语音消除、以及带或不带背景噪声的语音编辑等多种任务中均表现优异,在各项任务上达到或超越了专用模型的性能。演示样本请参见 https://aka.ms/speechx。