Voice style conversion aims to transform an input utterance to match a target speaker's timbre, accent, and emotion, with a central challenge being the disentanglement of linguistic content from style. While prior work has explored this problem, conversion quality remains limited, and real-time voice style conversion has not been addressed. We propose StyleStream, the first streamable zero-shot voice style conversion system that achieves state-of-the-art performance. StyleStream consists of two components: a Destylizer, which removes style attributes while preserving linguistic content, and a Stylizer, a diffusion transformer (DiT) that reintroduces target style conditioned on reference speech. Robust content-style disentanglement is enforced through text supervision and a highly constrained information bottleneck. This design enables a fully non-autoregressive architecture, achieving real-time voice style conversion with an end-to-end latency of 1 second. Samples and real-time demo: https://berkeley-speech-group.github.io/StyleStream/.
翻译:语音风格转换旨在将输入语音转换为与目标说话者的音色、口音和情感相匹配,其核心挑战在于将语言内容与风格解耦。尽管已有研究探索了这一问题,但转换质量仍有限制,且实时语音风格转换尚未得到解决。我们提出了StyleStream,首个可流式处理的零样本语音风格转换系统,实现了最先进的性能。StyleStream包含两个组件:一个去风格化器(Destylizer),用于去除风格属性同时保留语言内容;以及一个风格化器(Stylizer),这是一个基于参考语音条件化的扩散Transformer(DiT),用于重新引入目标风格。通过文本监督和高度受限的信息瓶颈,系统实现了鲁棒的内容-风格解耦。该设计支持完全非自回归架构,实现了端到端延迟为1秒的实时语音风格转换。样本和实时演示:https://berkeley-speech-group.github.io/StyleStream/。