Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks. See https://aka.ms/speechx for demo samples.
翻译:摘要:基于音频-文本提示的生成式语音模型最新进展已实现高质量零样本文本转语音等显著创新。然而,现有模型在处理涉及输入语音转换及恶劣声学条件下音频捕获的多样化音频-文本语音生成任务时仍面临局限。本文提出SpeechX——一种通用语音生成模型,能够完成零样本TTS及各类语音转换任务,同时处理干净信号与带噪信号。SpeechX将神经编解码语言建模与基于任务相关提示的多任务学习相结合,实现统一可扩展的建模方式,并为在语音增强与转换任务中利用文本输入提供一致途径。实验结果表明,SpeechX在多项任务中表现优异,包括零样本TTS、噪声抑制、目标说话人提取、语音移除及带/不带背景噪声的语音编辑,其性能在各项任务中达到或超越专用模型。演示样例请参见https://aka.ms/speechx。