Generating facial reactions in a human-human dyadic interaction is complex and highly dependent on the context since more than one facial reactions can be appropriate for the speaker's behaviour. This has challenged existing machine learning (ML) methods, whose training strategies enforce models to reproduce a specific (not multiple) facial reaction from each input speaker behaviour. This paper proposes the first multiple appropriate facial reaction generation framework that re-formulates the one-to-many mapping facial reaction generation problem as a one-to-one mapping problem. This means that we approach this problem by considering the generation of a distribution of the listener's appropriate facial reactions instead of multiple different appropriate facial reactions, i.e., 'many' appropriate facial reaction labels are summarised as 'one' distribution label during training. Our model consists of a perceptual processor, a cognitive processor, and a motor processor. The motor processor is implemented with a novel Reversible Multi-dimensional Edge Graph Neural Network (REGNN). This allows us to obtain a distribution of appropriate real facial reactions during the training process, enabling the cognitive processor to be trained to predict the appropriate facial reaction distribution. At the inference stage, the REGNN decodes an appropriate facial reaction by using this distribution as input. Experimental results demonstrate that our approach outperforms existing models in generating more appropriate, realistic, and synchronized facial reactions. The improved performance is largely attributed to the proposed appropriate facial reaction distribution learning strategy and the use of a REGNN. The code is available at https://github.com/TongXu-05/REGNN-Multiple-Appropriate-Facial-Reaction-Generation.
翻译:在人-人二元交互中生成面部反应是复杂且高度依赖上下文的,因为说话者的行为可能对应多种适切的面部反应。现有机器学习方法面临挑战,其训练策略强制模型从每个输入说话者行为中再现特定(而非多种)面部反应。本文首次提出多种适切面部反应生成框架,将一对多映射的面部反应生成问题重新表述为一对一映射问题。这意味着我们通过考虑生成听者适切面部反应的分布(而非多个不同的适切面部反应)来求解该问题,即训练过程中将“多个”适切面部反应标签归纳为“一个”分布标签。我们的模型由感知处理器、认知处理器和运动处理器组成。运动处理器采用新型可逆多维边图神经网络(REGNN)实现,这使我们能够在训练过程中获得真实适切面部反应的分布,从而训练认知处理器预测适切面部反应分布。在推断阶段,REGNN以该分布为输入解码出适切面部反应。实验结果表明,我们的方法在生成更适切、真实且同步的面部反应方面优于现有模型。性能提升主要归功于所提出的适切面部反应分布学习策略以及REGNN的应用。代码开源在https://github.com/TongXu-05/REGNN-Multiple-Appropriate-Facial-Reaction-Generation。