Beyond the Mouth: Upper-Face Affective Cues in Audiovisual Sentence Recognition under Acoustic Uncertainty

Face-to-face speech comprehension is inherently multimodal, integrating acoustic signals with visible articulation, facial expression, head motion, and other socially relevant cues. While audiovisual speech systems typically focus on the mouth region as the primary visual source of linguistic information, affective facial expressions are often treated separately as emotion-recognition targets. This paper investigates whether upper-face affective information contributes to audiovisual sentence recognition beyond audio and mouth-region cues, particularly under acoustic degradation. Using the CREMA-D audiovisual emotional speech corpus, we train feature-based sentence classifiers under four cue conditions: audio only (A), audio plus mouth/lower-face features (A+M), audio plus upper-face features (A+U), and audio plus both mouth and upper-face features (A+M+U). Models are evaluated on clean audio and pink-noise conditions at +10 dB, +5 dB, and 0 dB SNR using actor-independent splits. Results show that mouth/lower-face features provide substantial robustness benefits under degraded audio. At 0 dB SNR, A+M improves accuracy over A by 0.0794, with an actor-bootstrap 95% confidence interval of [0.0296, 0.1298]. Upper-face affective cues exhibit a more nuanced effect. Although the direct accuracy gain of A+M+U over A+M is small, full-face models consistently improve calibration across SNR levels and outperform shuffled upper-face controls under noisy conditions. These findings suggest that affective facial information may support multimodal robustness and confidence estimation under acoustic uncertainty without directly encoding lexical content. More broadly, the study highlights the potential role of socially expressive facial cues in human-centered audiovisual interaction systems.

翻译：面对面言语理解本质上是多模态的，它将声学信号与可见的发音、面部表情、头部运动以及其他社交相关线索整合在一起。尽管视听语音系统通常将嘴部区域作为语言信息的主要视觉来源，但情感面部表情往往被单独作为情感识别目标来处理。本文研究在声学退化条件下，上半部面部情感信息是否有助于超越音频和嘴部区域线索的视听句子识别。使用CREMA-D视听情感语音语料库，我们在四种线索条件下训练基于特征的句子分类器：仅有音频(A)、音频加嘴部/下半部面部特征(A+M)、音频加上半部面部特征(A+U)，以及音频加嘴部和上半部面部特征(A+M+U)。模型在干净音频和粉红噪声条件（信噪比为+10 dB、+5 dB和0 dB）下，采用演员无关的数据划分进行评估。结果表明，在退化音频条件下，嘴部/下半部面部特征提供了显著的鲁棒性优势。在0 dB信噪比下，A+M相比A的准确率提高了0.0794，演员自助法95%置信区间为[0.0296, 0.1298]。上半部面部情感线索表现出更微妙的效果。尽管A+M+U相对于A+M的准确率提升较小，但全脸模型在不同信噪比水平上持续改善了校准性能，并在噪声条件下优于随机打乱的上半部面部对照组。这些发现表明，情感面部信息可能在不直接编码词汇内容的情况下，支持声学不确定性下的多模态鲁棒性和置信度估计。更广泛而言，本研究突显了社交表达性面部线索在以人为中心的视听交互系统中的潜在作用。