Building state-of-the-art text-to-speech (TTS) systems typically demands millions of hours of proprietary data and complex multi-stage architectures, creating substantial barriers for resource-constrained research teams. In this report, we present PilotTTS, a lightweight autoregressive TTS system that achieves competitive performance through minimalist architecture and rigorous data engineering. PilotTTS is trained on only 200K hours of data processed entirely with open-source tools. Specifically, our contributions are: (1) a reproducible multi-stage data processing pipeline covering quality assessment, label annotation, and filtering, and (2) a compact model architecture that employs Q-Former-based conditioning to decouple speaker identity from speaking style via cross-sample paired training. Within a unified framework, PilotTTS supports zero-shot voice cloning, emotion synthesis (11 categories), paralinguistic synthesis (4 categories), and Chinese dialect synthesis (14 dialects). On the Seed-TTS Eval benchmark, PilotTTS achieves the lowest WER of 1.50% on test-en, a CER of 0.87% on test-zh, and the highest speaker similarity on both test sets (0.862 and 0.815), outperforming systems trained on significantly larger datasets. We release the complete data pipeline recipe, pretrained weights, and code at https://github.com/AMAPVOICE/PilotTTS.
翻译:构建先进的文本转语音系统通常需要数百万小时的专有数据与复杂的多阶段架构,这为资源受限的研究团队设置了巨大障碍。本报告提出PilotTTS——一种轻量级自回归TTS系统,通过极简架构与严谨的数据工程实现了竞争性性能。PilotTTS仅使用20万小时数据训练,且全部处理流程均基于开源工具。具体而言,我们的贡献包括:(1) 一个可复现的多阶段数据处理流水线,涵盖质量评估、标签标注与过滤;(2) 一种紧凑型模型架构,采用基于Q-Former的条件控制机制,通过跨样本配对训练解耦说话人身份与说话风格。在统一框架下,PilotTTS支持零样本语音克隆、情感合成(11类)、副语言合成(4类)及中文方言合成(14种方言)。在Seed-TTS Eval基准测试中,PilotTTS在test-en集上达到1.50%的最低WER,在test-zh集上取得0.87%的CER,并在两个测试集上均获得最高说话人相似度(0.862与0.815),表现优于使用更大数据集训练的系统。我们在https://github.com/AMAPVOICE/PilotTTS 开源了完整数据流水线方案、预训练权重及代码。