Human-Centered Multimodal Fusion for Sexism Detection in Memes with Eye-Tracking, Heart Rate, and EEG Signals

The automated detection of sexism in memes is a challenging task due to multimodal ambiguity, cultural nuance, and the use of humor to provide plausible deniability. Content-only models often fail to capture the complexity of human perception. To address this limitation, we introduce and validate a human-centered paradigm that augments standard content features with physiological data. We created a novel resource by recording Eye-Tracking (ET), Heart Rate (HR), and Electroencephalography (EEG) from 16 subjects (8 per experiment) while they viewed 3984 memes from the EXIST 2025 dataset. Our statistical analysis reveals significant physiological differences in how subjects process sexist versus non-sexist content. Sexist memes were associated with higher cognitive load, reflected in increased fixation counts and longer reaction times, as well as differences in EEG spectral power across the Alpha, Beta, and Gamma bands, suggesting more demanding neural processing. Building on these findings, we propose a multimodal fusion model that integrates physiological signals with enriched textual-visual features derived from a Vision-Language Model (VLM). Our final model achieves an AUC of 0.794 in binary sexism detection, a statistically significant 3.4% improvement over a strong VLM-based baseline. The fusion is particularly effective for nuanced cases, boosting the F1-score for the most challenging fine-grained category, Misogyny and Non-Sexual Violence, by 26.3%. These results show that physiological responses provide an objective signal of perception that enhances the accuracy and human-awareness of automated systems for countering online sexism.

翻译：表情包中性别歧视的自动检测是一项具有挑战性的任务，原因在于其多模态的模糊性、文化细微差别以及利用幽默提供的合理推诿。仅基于内容的模型往往无法捕捉人类感知的复杂性。为应对这一局限，我们引入并验证了一种以人为中心的范式，该范式通过生理数据增强了标准内容特征。我们创建了一个新颖的资源，记录了16名受试者（每次实验8名）在观看来自EXIST 2025数据集的3984个表情包时的眼动追踪、心率和脑电图信号。我们的统计分析揭示了受试者在处理性别歧视内容与非性别歧视内容时存在显著的生理差异。性别歧视表情包与更高的认知负荷相关，体现在注视点计数的增加和反应时间的延长，以及在Alpha、Beta和Gamma频段的脑电频谱功率差异，这表明了更耗神的神经处理过程。基于这些发现，我们提出了一种多模态融合模型，该模型将生理信号与源自视觉语言模型的增强文本-视觉特征相结合。我们的最终模型在二元性别歧视检测中实现了0.794的AUC，相较于一个强大的基于VLM的基线模型，取得了统计上显著的3.4%提升。这种融合对于微妙案例尤其有效，将最具挑战性的细粒度类别"厌女症与非性暴力"的F1分数提升了26.3%。这些结果表明，生理反应提供了一种客观的感知信号，能够增强用于应对网络性别歧视的自动化系统的准确性和对人类感知的考量。