In this study, we address the importance of modeling behavior style in virtual agents for personalized human-agent interaction. We propose a machine learning approach to synthesize gestures, driven by prosodic features and text, in the style of different speakers, even those unseen during training. Our model incorporates zero-shot multimodal style transfer using multimodal data from the PATS database, which contains videos of diverse speakers. We recognize style as a pervasive element during speech, influencing the expressivity of communicative behaviors, while content is conveyed through multimodal signals and text. By disentangling content and style, we directly infer the style embedding, even for speakers not included in the training phase, without the need for additional training or fine-tuning. Objective and subjective evaluations are conducted to validate our approach and compare it against two baseline methods.
翻译:本研究探讨了虚拟智能体在个性化人机交互中行为风格建模的重要性。我们提出了一种机器学习方法,该方法利用韵律特征与文本驱动手势合成,能够生成不同说话人(包括训练中未见过的说话人)的风格化手势。我们的模型基于PATS数据库(包含多说话人视频)中的多模态数据,实现了零样本多模态风格迁移。本研究将风格视为贯穿言语过程的渗透性要素,可影响交际行为的表达性,而内容则通过多模态信号与文本传递。通过解耦内容与风格,我们能够直接推断风格嵌入向量,即使对于训练阶段未包含的说话人,也无需额外训练或微调。我们进行了客观与主观评估来验证该方法,并与两种基线方法进行了比较。