This technical report presents MOSS-TTS, a speech generation foundation model built on a scalable recipe: discrete audio tokens, autoregressive modeling, and large-scale pretraining. Built on MOSS-Audio-Tokenizer, a causal Transformer tokenizer that compresses 24 kHz audio to 12.5 fps with variable-bitrate RVQ and unified semantic-acoustic representations, we release two complementary generators: MOSS-TTS, which emphasizes structural simplicity, scalability, and long-context/control-oriented deployment, and MOSS-TTS-Local-Transformer, which introduces a frame-local autoregressive module for higher modeling efficiency, stronger speaker preservation, and a shorter time to first audio. Across multilingual and open-domain settings, MOSS-TTS supports zero-shot voice cloning, token-level duration control, phoneme-/pinyin-level pronunciation control, smooth code-switching, and stable long-form generation. This report summarizes the design, training recipe, and empirical characteristics of the released models.
翻译:本技术报告介绍了MOSS-TTS,一种基于可扩展方案(离散音频token、自回归建模和大规模预训练)构建的语音生成基础模型。基于MOSS-Audio-Tokenizer(一种因果Transformer分词器,通过可变比特率RVQ和统一语义-声学表征将24 kHz音频压缩至12.5 fps),我们发布了两种互补的生成器:MOSS-TTS(强调结构简洁性、可扩展性及面向长上下文/控制的部署)和MOSS-TTS-Local-Transformer(引入帧局部自回归模块,以提升建模效率、增强说话人保留能力并缩短首段音频生成时间)。在多语言和开放域场景下,MOSS-TTS支持零样本语音克隆、token级时长控制、音素级/拼音级发音控制、流畅的语种码切换以及稳定的长文本生成。本报告总结了所发布模型的设计方案、训练流程及实验特性。