We propose GLASS, a framework for composable acoustic style control in zero-shot autoregressive text-to-speech (TTS) that learns controls from post-generation rewards rather than style labels. In zero-shot TTS, a speaker prompt often entangles speaker identity with prosodic attributes such as speaking rate and pitch, making it difficult to change style without changing the prompt itself. GLASS instead treats each acoustic attribute as a reward-defined control direction. For each control axis, GLASS freezes the TTS backbone and trains one lightweight LoRA adapter with Group Relative Policy Optimization (GRPO), using speech-token length and mean F0 as style rewards and WER as an intelligibility anchor. Because each control is represented as a LoRA weight update, independently trained adapters can be swapped, interpolated, and composed through linear LoRA arithmetic without retraining the backbone. Experiments on speaking rate and pitch control show targeted style shifts while preserving naturalness, speaker similarity, and intelligibility, and demonstrate smooth interpolation and multi-axis composition across independently trained adapters.
翻译:我们提出GLASS,一个用于零样本自回归文本到语音(TTS)中可组合声学风格控制的框架,该框架从生成后奖励而非风格标签中学习控制。在零样本TTS中,说话人提示常将说话人身份与语速、音高等韵律属性纠缠在一起,使得在不改变提示本身的情况下难以改变风格。GLASS将每个声学属性视为一个由奖励定义的控制方向。对于每个控制轴,GLASS冻结TTS主干网络,并使用群组相对策略优化(GRPO)训练一个轻量级LoRA适配器,以语音标记长度和平均基频作为风格奖励,以词错误率(WER)作为可懂度锚点。由于每个控制表示为LoRA权重更新,独立训练的适配器可通过线性LoRA算术进行交换、插值和组合,而无需重新训练主干网络。在语速和音高控制上的实验表明,该方法在保持自然度、说话人相似性和可懂度的同时,实现了目标风格偏移,并展示了跨独立训练适配器的平滑插值与多轴组合能力。