Engagement estimation plays a crucial role in understanding human social behaviors, attracting increasing research interests in fields such as affective computing and human-computer interaction. In this paper, we propose a Dialogue-Aware Transformer framework (DAT) with Modality-Group Fusion (MGF), which relies solely on audio-visual input and is language-independent, for estimating human engagement in conversations. Specifically, our method employs a modality-group fusion strategy that independently fuses audio and visual features within each modality for each person before inferring the entire audio-visual content. This strategy significantly enhances the model's performance and robustness. Additionally, to better estimate the target participant's engagement levels, the introduced Dialogue-Aware Transformer considers both the participant's behavior and cues from their conversational partners. Our method was rigorously tested in the Multi-Domain Engagement Estimation Challenge held by MultiMediate'24, demonstrating notable improvements in engagement-level regression precision over the baseline model. Notably, our approach achieves a CCC score of 0.76 on the NoXi Base test set and an average CCC of 0.64 across the NoXi Base, NoXi-Add, and MPIIGI test sets.
翻译:参与度估计在理解人类社交行为中扮演着关键角色,日益受到情感计算和人机交互等领域的研究关注。本文提出一种基于模态分组融合的对话感知Transformer框架,该框架仅依赖视听输入且与语言无关,用于估计对话中的人类参与度。具体而言,我们的方法采用模态分组融合策略,在推断完整视听内容之前,先为每个参与者独立融合其各自模态内的音频和视觉特征。该策略显著提升了模型的性能与鲁棒性。此外,为了更好地估计目标参与者的参与度水平,所引入的对话感知Transformer同时考虑了该参与者的行为及其对话伙伴的线索。我们的方法在MultiMediate'24举办的多领域参与度估计挑战中经过了严格测试,结果表明其在参与度水平回归精度上相比基线模型有显著提升。值得注意的是,我们的方法在NoXi Base测试集上取得了0.76的CCC分数,并在NoXi Base、NoXi-Add和MPIIGI测试集上平均CCC达到0.64。