Contemporary conversational systems often present a significant limitation: their responses lack the emotional depth and disfluent characteristic of human interactions. This absence becomes particularly noticeable when users seek more personalized and empathetic interactions. Consequently, this makes them seem mechanical and less relatable to human users. Recognizing this gap, we embarked on a journey to humanize machine communication, to ensure AI systems not only comprehend but also resonate. To address this shortcoming, we have designed an innovative speech synthesis pipeline. Within this framework, a cutting-edge language model introduces both human-like emotion and disfluencies in a zero-shot setting. These intricacies are seamlessly integrated into the generated text by the language model during text generation, allowing the system to mirror human speech patterns better, promoting more intuitive and natural user interactions. These generated elements are then adeptly transformed into corresponding speech patterns and emotive sounds using a rule-based approach during the text-to-speech phase. Based on our experiments, our novel system produces synthesized speech that's almost indistinguishable from genuine human communication, making each interaction feel more personal and authentic.
翻译:当代对话系统普遍存在显著局限:其回应缺乏人类交互特有的情感深度与非流利特征。当用户寻求更具个性化和共情性的交互体验时,这一缺失尤为突出,导致系统显得机械生硬且难以引发用户共鸣。针对这一不足,我们开启人性化机器交流的研究旅程,致力于使AI系统不仅能理解语言,更能实现情感共振。为此,我们设计了一套创新性语音合成框架。在该架构中,前沿语言模型以零样本方式同时生成类人情感特征与非流利现象,这些精细特质在文本生成阶段被无缝整合到输出文本中,使系统能够更精准地模拟人类语音模式,从而促进更自然直观的用户交互。随后,在文本转语音阶段,通过基于规则的方法将这些生成要素高效转化为对应的语音模式与情感音效。实验表明,我们的新型系统可合成出几乎与真实人类交流无异的语音,使每次交互都更具个人化与真实感。