Zero-shot text-to-speech (TTS) relies on robust speech representations. However, current speech tokenizers face a fundamental trade-off: acoustic codecs preserve high-fidelity audio but lack linguistic constraints, causing content errors during generation, whereas semantic tokens from self-supervised learning (SSL) models ensure precise text alignment but discard some acoustic information. To bridge this gap, we propose SARA, a dual-stream VAE that directly fuses a frozen SSL semantic anchor with a dedicated residual acoustic encoder. This effectively mitigates the dilemma, creating an efficient and compact latent space without relying on complex regularizers. SARA achieves superior reconstruction quality over strong baselines. Furthermore, in downstream zero-shot TTS tasks, it yields highly natural and expressive synthesis quality, and maintains robust generation performance even under accelerated inference, offering a favorable trade-off between synthesis speed and computational cost.
翻译:零样本文本转语音(TTS)依赖于鲁棒的语音表征。然而,当前的语音分词器面临根本性权衡:声学编解码器保留了高保真音频,但缺乏语言约束,导致生成过程中出现内容错误;而来自自监督学习(SSL)模型的语义标记尽管能确保精确的文本对齐,却丢弃了部分声学信息。为弥合这一差距,我们提出SARA——一种直接融合冻结的SSL语义锚点与专用残差声学编码器的双流VAE。该方法有效缓解了上述困境,无需依赖复杂正则化器即可构建高效紧凑的潜在空间。在重建质量方面,SARA超越了强基线模型。此外,在下游零样本TTS任务中,SARA可生成高度自然且富有表现力的合成语音,即便在加速推理条件下仍保持稳健的生成性能,为合成速度与计算成本之间提供了有利的平衡。