When Fine-Tuning Fails and when it Generalises: Role of Data Diversity and Mixed Training in LLM-based TTS

from arxiv, We finetune the Qwen 0.5B backbone in an LLM TTS with LoRA to raise MOS speaker similarity and SNR. It works best with diverse training audio with uniform data it can amplify noise so tune decoding and use GGUF quantization for low latency stable quality

Large language models are increasingly adopted as semantic backbones for neural text-to-speech systems. However, frozen LLM representations are insufficient for modeling speaker specific acoustic and perceptual characteristics. Our experiments involving fine tuning of the Language Model backbone of TTS show promise in improving the voice consistency and Signal to Noise ratio SNR in voice cloning task. Across multiple speakers LoRA finetuning consistently outperforms the non-finetuned base Qwen-0.5B model across three complementary dimensions of speech quality. First, perceptual quality improves significantly with DNS-MOS gains of up to 0.42 points for speakers whose training data exhibits sufficient acoustic variability. Second, speaker fidelity improves for all evaluated speakers with consistent increases in voice similarity indicating that LoRA effectively adapts speaker identity representations without degrading linguistic modeling. Third, signal level quality improves in most cases with signal to noise ratio increasing by as much as 34 percent. Crucially these improvements are strongly governed by the characteristics of the training data. Speakers with high variability in acoustic energy and perceptual quality achieve simultaneous gains in DNS-MOS voice similarity and SNR. Overall this work establishes that LoRA finetuning is not merely a parameter efficient optimization technique but an effective mechanism for better speaker level adaptation in compact LLM-based TTS systems. When supported by sufficiently diverse training data LoRA adapted Qwen-0.5B consistently surpasses its frozen base model in perceptual quality speaker similarity with low latency using GGUF model hosted in quantized form.

翻译：大型语言模型正日益被用作神经文本转语音系统的语义骨干。然而，冻结的LLM表征不足以建模说话人特定的声学和感知特征。我们通过微调TTS的语言模型骨干进行的实验表明，该方法在语音克隆任务中能提升语音一致性和信噪比。在多个说话人上，LoRA微调在语音质量的三个互补维度上均持续优于未微调的Qwen-0.5B基础模型。首先，对于训练数据表现出足够声学变异性的说话人，其感知质量显著提升，DNS-MOS增益最高可达0.42分。其次，所有评估说话人的说话人保真度均得到改善，语音相似度持续提高，表明LoRA能有效适配说话人身份表征而不损害语言建模能力。第三，在多数情况下信号级质量得到提升，信噪比最高可增加34%。关键在于，这些改进强烈依赖于训练数据的特性。在声学能量和感知质量上具有高变异性的说话人，能在DNS-MOS、语音相似度和SNR上同时获得增益。总体而言，本研究证实LoRA微调不仅是一种参数高效的优化技术，更是紧凑型基于LLM的TTS系统中实现更佳说话人级适配的有效机制。当得到足够多样化的训练数据支持时，经LoRA适配的Qwen-0.5B在使用量化形式托管的GGUF模型实现低延迟的同时，在感知质量与说话人相似度上持续超越其冻结的基础模型。