With the development of speech synthesis, recent research has focused on challenging tasks, such as speaker generation and emotion intensity control. Attribute interpolation is a common approach to these tasks. However, most previous methods for attribute interpolation require specific modules or training methods. We propose an attribute interpolation method in speech synthesis by model merging. Model merging is a method that creates new parameters by only averaging the parameters of base models. The merged model can generate an output with an intermediate feature of the base models. This method is easily applicable without specific modules or training methods, as it uses only existing trained base models. We merged two text-to-speech models to achieve attribute interpolation and evaluated its performance on speaker generation and emotion intensity control tasks. As a result, our proposed method achieved smooth attribute interpolation while keeping the linguistic content in both tasks.
翻译:随着语音合成技术的发展,近期研究聚焦于说话人生成与情感强度控制等挑战性任务。属性插值是实现这些任务的常用方法。然而,现有的大多数属性插值方法需要特定的模块或训练策略。本文提出一种基于模型融合的语音合成属性插值方法。模型融合是一种仅通过对基础模型参数进行平均来生成新参数的技术。融合后的模型能够生成具有基础模型中间特征的输出。该方法仅需利用已训练的基础模型,无需特定模块或训练策略,具有易部署的优势。我们通过融合两个文本转语音模型实现了属性插值,并在说话人生成与情感强度控制任务上评估了其性能。实验结果表明,所提方法在保持语言内容完整性的同时,在两个任务中均实现了平滑的属性插值。