While emotional text-to-speech (TTS) has made significant progress, most existing research remains limited to utterance-level emotional expression and fails to support word-level control. Achieving word-level expressive control poses fundamental challenges, primarily due to the complexity of modeling multi-emotion transitions and the scarcity of annotated datasets that capture intra-sentence emotional and prosodic variation. In this paper, we propose WeSCon, the first self-training framework that enables word-level control of both emotion and speaking rate in a pretrained zero-shot TTS model, without relying on datasets containing intra-sentence emotion or speed transitions. Our method introduces a transition-smoothing strategy and a dynamic speed control mechanism to guide the pretrained TTS model in performing word-level expressive synthesis through a multi-round inference process. To further simplify the inference, we incorporate a dynamic emotional attention bias mechanism and fine-tune the model via self-training, thereby activating its ability for word-level expressive control in an end-to-end manner. Experimental results show that WeSCon effectively overcomes data scarcity, achieving state-of-the-art performance in word-level emotional expression control while preserving the strong zero-shot synthesis capabilities of the original TTS model.
翻译:尽管情感文本转语音合成技术已取得显著进展,但现有研究大多仍局限于语句级情感表达,无法支持词级控制。实现词级表达控制面临根本性挑战,主要源于多情感过渡建模的复杂性,以及缺乏捕捉句内情感与韵律变化的标注数据集。本文提出WeSCon——首个自训练框架,可在预训练的零样本TTS模型中实现词级情感与语速的双重控制,且无需依赖包含句内情感或语速过渡的数据集。本方法通过过渡平滑策略与动态语速控制机制,引导预训练TTS模型通过多轮推理过程实现词级表达合成。为简化推理流程,我们引入动态情感注意力偏置机制,并通过自训练对模型进行微调,从而以端到端方式激活其词级表达控制能力。实验结果表明,WeSCon能有效克服数据稀缺问题,在保持原始TTS模型强大零样本合成能力的同时,实现了词级情感表达控制的最先进性能。