The goal of this paper is to introduce SPADE, a framework for Structured Pruning and Adaptive Distillation for Efficient Large Language Model-based text-to-speech (LLM-TTS). Recent LLM-TTS systems achieve strong controllability and zero-shot generalization, but their large parameter counts and high latency limit real-world deployment. SPADE addresses this by combining (i) a pruning step guided by a word-error-rate-based layer importance index to remove non-essential Transformer layers, with (ii) multi-level knowledge distillation to restore autoregressive coherence. On zero-shot benchmarks, SPADE preserves near-parity perceptual quality while halving Transformer depth, reducing VRAM usage by up to 20%, and achieving up to 1.7x faster real-time factor with less than 5% of the original training data. These results show that compact LLM-TTS models can maintain naturalness and speaker similarity while enabling practical real-time speech generation. Audio samples are available at https://mm.kaist.ac.kr/projects/SPADE/.
翻译:本文旨在介绍SPADE,一种面向高效大语言模型文本转语音系统的结构化剪枝与自适应蒸馏框架。近期的大语言模型文本转语音系统在可控性与零样本泛化方面表现出色,但其庞大的参数量与高延迟限制了实际部署。SPADE通过结合以下两点解决该问题:(i)基于词错误率的层重要性指标引导的剪枝步骤,以移除非必要的Transformer层;(ii)多层级知识蒸馏以恢复自回归一致性。在零样本基准测试中,SPADE在将Transformer深度减半的同时,保持了接近原模型的感知质量,显存使用量降低高达20%,实时因子提升达1.7倍,且仅需不到原始训练数据量的5%。这些结果表明,紧凑的大语言模型文本转语音模型能够在保持自然度与说话人相似度的同时,实现实用的实时语音生成。音频样本可在 https://mm.kaist.ac.kr/projects/SPADE/ 获取。