We investigate whether acoustic emotion recognition models can serve as proxies for the Pathos dimension in political speech analysis, as operationalised by the TRUST multi-agent large language model (LLM) pipeline. Using a Bundestag plenary speech by Felix Banaszak (51 segments, 245 s) as a case study, we compare three analysis modalities: (1) emotion2vec_plus_large, an acoustic speech emotion recognition (SER) model whose continuous Arousal and Valence values are derived via post-hoc Russell Circumplex projection; (2) Gemini 2.5 Flash, an LLM analysing the full speech audio together with its transcript in an open-ended, context-aware fashion; and (3) TRUST-Pathos scores from a three-advocate LLM supervisor ensemble. Spearman rank correlations reveal that Gemini Valence correlates strongly with TRUST-Pathos (rho = +0.664, p < 0.001), whereas emotion2vec Valence does not (rho = +0.097, p = 0.499). We further demonstrate, via a systematic quality evaluation of the Berlin Database of Emotional Speech (EMO-DB) using Gemini in an open-ended annotation paradigm, that standard SER benchmark corpora suffer from acted speech, cultural bias, and category incompatibility. Our results suggest that LLM-based multimodal analysis captures semantically defined political emotion substantially better than acoustic models alone, while acoustic features remain informative for low-level Arousal estimation. Future work will extend this approach to video-based analysis incorporating facial expression and gaze.
翻译:本研究探讨声学情感识别模型能否作为政治演讲分析中“情感诉求”(Pathos)维度的代理指标。该维度通过TRUST多智能体大语言模型(LLM)流水线进行可操作化定义。以费利克斯·巴纳扎克在德国联邦议院的全会演讲(51个片段,总时长245秒)作为案例,我们比较了三种分析模态:(1)emotion2vec_plus_large声学语音情感识别(SER)模型,其连续维度“唤醒度”与“效价”通过事后罗素环形投影推导得出;(2)Gemini 2.5 Flash大语言模型,以开放式、情境感知方式分析完整语音音频及其转录文本;(3)由三位评估者组成的LLM监督集成系统输出的TRUST-Pathos评分。斯皮尔曼秩相关分析表明:Gemini效价评分与TRUST-Pathos呈强正相关(rho=+0.664,p<0.001),而emotion2vec效价评分则未呈现显著相关性(rho=+0.097,p=0.499)。通过采用Gemini进行开放式标注范式,我们对柏林情感语音数据库(EMO-DB)开展系统性质量评估后进一步证实:标准SER基准语料库存在表演性语音、文化偏差及类别不兼容等问题。研究结果表明,基于LLM的多模态分析在捕捉语义定义的政治情感方面显著优于单一声学模型,而声学特征对低层次唤醒度估计仍具有参考价值。未来工作将扩展至包含面部表情与注视方向的视频分析。