IndexTTS 2.5 技术报告 (IndexTTS 2.5 Technical Report)

In prior work, we introduced IndexTTS 2, a zero-shot neural text-to-speech foundation model comprising two core components: a transformer-based Text-to-Semantic (T2S) module and a non-autoregressive Semantic-to-Mel (S2M) module, which together enable faithful emotion replication and establish the first autoregressive duration-controllable generative paradigm. Building upon this, we present IndexTTS 2.5, which significantly enhances multilingual coverage, inference speed, and overall synthesis quality through four key improvements: 1) Semantic Codec Compression: we reduce the semantic codec frame rate from 50 Hz to 25 Hz, halving sequence length and substantially lowering both training and inference costs; 2) Architectural Upgrade: we replace the U-DiT-based backbone of the S2M module with a more efficient Zipformer-based modeling architecture, achieving notable parameter reduction and faster mel-spectrogram generation; 3) Multilingual Extension: We propose three explicit cross-lingual modeling strategies, boundary-aware alignment, token-level concatenation, and instruction-guided generation, establishing practical design principles for zero-shot multilingual emotional TTS that supports Chinese, English, Japanese, and Spanish, and enables robust emotion transfer even without target-language emotional training data; 4) Reinforcement Learning Optimization: we apply GRPO in post-training of the T2S module, improving pronunciation accuracy and natrualness. Experiments show that IndexTTS 2.5 not only supports broader language coverage but also replicates emotional prosody in unseen languages under the same zero-shot setting. IndexTTS 2.5 achieves a 2.28 times improvement in RTF while maintaining comparable WER and speaker similarity to IndexTTS 2.

翻译：在先前的工作中，我们提出了 IndexTTS 2，这是一个零样本神经文本转语音基础模型，包含两个核心组件：一个基于 Transformer 的文本到语义模块和一个非自回归的语义到梅尔频谱模块。二者共同实现了忠实的情感复现，并建立了首个自回归时长可控的生成范式。在此基础上，我们提出了 IndexTTS 2.5，该版本通过四项关键改进显著增强了多语言覆盖范围、推理速度和整体合成质量：1) 语义编解码器压缩：我们将语义编解码器的帧率从 50 Hz 降低至 25 Hz，序列长度减半，从而大幅降低了训练和推理成本；2) 架构升级：我们将 S2M 模块中基于 U-DiT 的主干网络替换为更高效的基于 Zipformer 的建模架构，实现了显著的参数量减少和更快的梅尔频谱图生成；3) 多语言扩展：我们提出了三种显式的跨语言建模策略——边界感知对齐、词元级拼接和指令引导生成，为零样本多语言情感 TTS 建立了实用的设计原则，支持中文、英文、日文和西班牙文，并且即使在缺乏目标语言情感训练数据的情况下也能实现鲁棒的情感迁移；4) 强化学习优化：我们在 T2S 模块的后训练中应用了 GRPO，提高了发音准确性和自然度。实验表明，IndexTTS 2.5 不仅支持更广泛的语言覆盖，还能在相同的零样本设置下复现未见语言的情感韵律。IndexTTS 2.5 在保持与 IndexTTS 2 相当的词错误率和说话人相似度的同时，实现了 2.28 倍的实时因子提升。