Multimodal Large Language Models (MLLMs) have demonstrated remarkable multimodal emotion recognition capabilities, integrating multimodal cues from visual, acoustic, and linguistic contexts in the video to recognize human emotional states. However, existing methods ignore capturing local facial features of temporal dynamics of micro-expressions and do not leverage the contextual dependencies of the utterance-aware temporal segments in the video, thereby limiting their expected effectiveness to a certain extent. In this work, we propose MicroEmo, a time-sensitive MLLM aimed at directing attention to the local facial micro-expression dynamics and the contextual dependencies of utterance-aware video clips. Our model incorporates two key architectural contributions: (1) a global-local attention visual encoder that integrates global frame-level timestamp-bound image features with local facial features of temporal dynamics of micro-expressions; (2) an utterance-aware video Q-Former that captures multi-scale and contextual dependencies by generating visual token sequences for each utterance segment and for the entire video then combining them. Preliminary qualitative experiments demonstrate that in a new Explainable Multimodal Emotion Recognition (EMER) task that exploits multi-modal and multi-faceted clues to predict emotions in an open-vocabulary (OV) manner, MicroEmo demonstrates its effectiveness compared with the latest methods.
翻译:多模态大语言模型(MLLMs)已展现出卓越的多模态情感识别能力,其通过整合视频中视觉、听觉和语言上下文的多模态线索来识别人类情感状态。然而,现有方法忽略了捕捉微表情时序动态的局部面部特征,且未能利用视频中话语感知时序片段的上下文依赖性,这在一定程度上限制了其预期效果。在本工作中,我们提出了MicroEmo,一种时序敏感的MLLM,旨在将注意力引导至局部面部微表情动态以及话语感知视频片段的上下文依赖关系。我们的模型包含两个关键架构贡献:(1)一个全局-局部注意力视觉编码器,它将全局帧级时间戳绑定的图像特征与微表情时序动态的局部面部特征相融合;(2)一个话语感知视频Q-Former,通过为每个话语片段以及整个视频生成视觉令牌序列并将其组合,从而捕获多尺度及上下文依赖关系。初步定性实验表明,在一个新的可解释多模态情感识别(EMER)任务中——该任务利用多模态、多方面的线索以开放词汇(OV)方式预测情感——MicroEmo相较于最新方法展现了其有效性。