Hierarchical Vision-Language Interaction for Facial Action Unit Detection

Facial Action Unit (AU) detection seeks to recognize subtle facial muscle activations as defined by the Facial Action Coding System (FACS). A primary challenge w.r.t AU detection is the effective learning of discriminative and generalizable AU representations under conditions of limited annotated data. To address this, we propose a Hierarchical Vision-language Interaction for AU Understanding (HiVA) method, which leverages textual AU descriptions as semantic priors to guide and enhance AU detection. Specifically, HiVA employs a large language model to generate diverse and contextually rich AU descriptions to strengthen language-based representation learning. To capture both fine-grained and holistic vision-language associations, HiVA introduces an AU-aware dynamic graph module that facilitates the learning of AU-specific visual representations. These features are further integrated within a hierarchical cross-modal attention architecture comprising two complementary mechanisms: Disentangled Dual Cross-Attention (DDCA), which establishes fine-grained, AU-specific interactions between visual and textual features, and Contextual Dual Cross-Attention (CDCA), which models global inter-AU dependencies. This collaborative, cross-modal learning paradigm enables HiVA to leverage multi-grained vision-based AU features in conjunction with refined language-based AU details, culminating in robust and semantically enriched AU detection capabilities. Extensive experiments show that HiVA consistently surpasses state-of-the-art approaches. Besides, qualitative analyses reveal that HiVA produces semantically meaningful activation patterns, highlighting its efficacy in learning robust and interpretable cross-modal correspondences for comprehensive facial behavior analysis.

翻译：面部动作单元（AU）检测旨在识别由面部动作编码系统（FACS）定义的细微面部肌肉激活。AU检测面临的一个主要挑战是在有限标注数据条件下，如何有效学习具有判别性和泛化性的AU表征。为此，我们提出了一种用于AU理解的层级视觉-语言交互方法（HiVA），该方法利用文本化的AU描述作为语义先验来引导和增强AU检测。具体而言，HiVA采用大型语言模型生成多样化且上下文丰富的AU描述，以加强基于语言的表征学习。为捕捉细粒度与整体性的视觉-语言关联，HiVA引入了AU感知动态图模块，以促进AU特异性视觉表征的学习。这些特征进一步集成在一个层级跨模态注意力架构中，该架构包含两种互补机制：解耦双重交叉注意力（DDCA），用于建立视觉与文本特征之间细粒度的、AU特定的交互；以及上下文双重交叉注意力（CDCA），用于建模全局的AU间依赖关系。这种协作式的跨模态学习范式使HiVA能够结合多粒度的基于视觉的AU特征与精炼的基于语言的AU细节，最终形成鲁棒且语义增强的AU检测能力。大量实验表明，HiVA consistently surpasses state-of-the-art approaches。此外，定性分析显示HiVA能产生具有语义意义的激活模式，突显了其在学习鲁棒且可解释的跨模态对应关系以进行全面面部行为分析方面的有效性。