This paper addresses the challenge of transferring the behavior expressivity style of a virtual agent to another one while preserving behaviors shape as they carry communicative meaning. Behavior expressivity style is viewed here as the qualitative properties of behaviors. We propose TranSTYLer, a multimodal transformer based model that synthesizes the multimodal behaviors of a source speaker with the style of a target speaker. We assume that behavior expressivity style is encoded across various modalities of communication, including text, speech, body gestures, and facial expressions. The model employs a style and content disentanglement schema to ensure that the transferred style does not interfere with the meaning conveyed by the source behaviors. Our approach eliminates the need for style labels and allows the generalization to styles that have not been seen during the training phase. We train our model on the PATS corpus, which we extended to include dialog acts and 2D facial landmarks. Objective and subjective evaluations show that our model outperforms state of the art models in style transfer for both seen and unseen styles during training. To tackle the issues of style and content leakage that may arise, we propose a methodology to assess the degree to which behavior and gestures associated with the target style are successfully transferred, while ensuring the preservation of the ones related to the source content.
翻译:本文针对在保持行为形态(因其承载交际意义)的前提下,将虚拟智能体的行为表现风格迁移至另一智能体的挑战。行为表现风格在此被视为行为的定性属性。我们提出TranSTYLer——一种基于多模态Transformer的模型,可合成源说话者的多模态行为并赋予目标说话者的风格。我们假设行为表现风格编码于包括文本、语音、身体姿态与面部表情在内的多种交际模态中。该模型采用风格与内容解耦框架,确保迁移后的风格不干扰源行为所传达的语义。本方法无需风格标签,并可泛化至训练阶段未见过的风格。我们在PATS语料库上训练模型,并扩展该库纳入对话行为与二维面部关键点。客观与主观评估表明,本模型在训练中见过的风格及未见风格的迁移上均优于现有最先进模型。针对可能出现的风格与内容泄露问题,我们提出一种评估方法,用以衡量目标风格相关的行为与姿态成功迁移的程度,同时确保源内容相关的行为与姿态得以保留。