The objective of the Multiple Appropriate Facial Reaction Generation (MAFRG) task is to produce contextually appropriate and diverse listener facial behavioural responses based on the multimodal behavioural data of the conversational partner (i.e., the speaker). Current methodologies typically assume continuous availability of speech and facial modality data, neglecting real-world scenarios where these data may be intermittently unavailable, which often results in model failures. Furthermore, despite utilising advanced deep learning models to extract information from the speaker's multimodal inputs, these models fail to adequately leverage the speaker's emotional context, which is vital for eliciting appropriate facial reactions from human listeners. To address these limitations, we propose an Emotion-aware Modality Compensatory (EMC) framework. This versatile solution can be seamlessly integrated into existing models, thereby preserving their advantages while significantly enhancing performance and robustness in scenarios with missing modalities. Our framework ensures resilience when faced with missing modality data through the Compensatory Modality Alignment (CMA) module. It also generates more appropriate emotion-aware reactions via the Emotion-aware Attention (EA) module, which incorporates the speaker's emotional information throughout the entire encoding and decoding process. Experimental results demonstrate that our framework improves the appropriateness metric FRCorr by an average of 57.2\% compared to the original model structure. In scenarios where speech modality data is missing, the performance of appropriate generation shows an improvement, and when facial data is missing, it only exhibits minimal degradation.
翻译:多重适当面部反应生成(MAFRG)任务的目标是基于对话伙伴(即说话者)的多模态行为数据,生成上下文适当且多样化的听者面部行为反应。现有方法通常假设语音和面部模态数据连续可用,忽视了这些数据可能间歇性缺失的现实场景,这往往导致模型失效。此外,尽管利用先进的深度学习模型从说话者的多模态输入中提取信息,这些模型未能充分挖掘说话者的情感语境,而情感语境对于引发人类听者适当的面部反应至关重要。为解决这些局限性,我们提出了一种情感感知模态补偿(EMC)框架。这一通用解决方案可无缝集成到现有模型中,从而在保留其优势的同时,显著提升模态缺失场景下的性能与鲁棒性。我们的框架通过补偿模态对齐(CMA)模块确保在面对模态数据缺失时的韧性,并通过情感感知注意力(EA)模块生成更适当的情感感知反应,该模块将说话者的情感信息整合到整个编码与解码过程中。实验结果表明,相较于原始模型结构,我们的框架将适当性指标FRCorr平均提升了57.2%。在语音模态数据缺失的场景下,适当生成性能有所改善;而在面部数据缺失时,性能仅出现轻微下降。