Faithful Action-unit Causal Reasoning for Counterfactually Faithful Emotion Explanations

Multimodal models can name the action units (AUs) behind a facial emotion, but their AU->emotion rationales are typically plausible rather than faithful: nothing forces the AUs a model invokes to be the AUs that actually drive its prediction. We cast AU->emotion reasoning as a counterfactual-consistency problem between the rationale, the label, and a structural AU->emotion causal graph G, and propose FACR, which grounds the reasoner in an independently induced, polarity-aware G and trains a counterfactual-faithfulness objective: a do-intervention on an AU that G marks causal for a class must move the prediction, while one it marks irrelevant must leave it unchanged. Faithfulness is thereby both trainable and measurable through a matching interventional metric, which we evaluate against a known causal structure, the PSPI pain-AU composition, as no existing affective-reasoning benchmark allows. We are explicit that this metric tests fidelity to the supplied structure rather than its rediscovery: it asks whether the trained reasoner invokes the AUs the structure marks causal, on held-out subjects and a second dataset. Under subject-independent evaluation on UNBC-PAIN, the objective raises the agreement between the invoked AUs and the PSPI composition from a no-objective baseline of 0.08 to 0.57, at a small detection cost; an unfaithfulness control attributes the gain to the objective. On a cross-dataset emotion transfer, the objective likewise raises fidelity to G on a seven-class task (0.50 to 0.84). Finally, we attach a language verbalizer and extend the audit to the generated text: biasing each action unit's emission by its latent activation makes the rationale faithful by construction, so that ablating an AU removes it from the explanation, a property that transfers to a second language-model backbone, whereas a freely generated rationale is unfaithful.

翻译：多模态模型能够命名面部表情背后的动作单元（AUs），但其从AU到情感的推理路径通常只是看似合理而非真正可信：没有任何机制确保模型所调用的AU确实是驱动其预测行为的真实AU。我们将从AU到情感的推理重构为推理依据、标签与结构化AU-情感因果图G之间的反事实一致性对齐问题，并提出FACR方法——该方法将推理器植根于独立诱导的极性感知因果图G中，并训练一个反事实可信度目标：对G标记为某类别因果的AU执行do-干预时，必须改变模型预测；而对G标记为无关的AU执行干预时，预测保持不变。由此，可信度既可通过匹配的干预度量进行训练，也可被测量。由于现有情感推理基准均不支持此类评估，我们使用已知因果结构（PSPI疼痛-AU组合）验证该度量指标。需明确指出：该度量检验的是训练后的推理器对给定因果结构的忠实程度（而非重新发现该结构），即验证推理器是否调用结构标记为因果的AU——这一验证在保留测试对象和第二个数据集上进行。在UNBC-PAIN数据集上的独立对象评估中，该目标使调用AU与PSPI组合的一致性从无目标基线的0.08提升至0.57（仅付出微小检测代价）；不忠实对照组实验证实该增益直接源于目标函数。在跨数据集情感迁移任务中，该目标同样将七分类任务上对因果图G的忠实度从0.50提升至0.84。最终，我们附加语言化模块并将审计延伸至生成文本：通过将每个动作单元的输出概率与其潜在激活相关联，使推理依据在构造层面即具备可信性——删除某AU即可从解释中移除对应内容，该性质可迁移至第二个语言模型骨干架构，而自由生成的推理依据则不具备可信性。