Research increasingly leverages audio-visual materials to analyze emotions in political communication. Multimodal large language models (mLLMs) promise to enable such analyses through in-context learning. However, we lack systematic evidence on whether these models can reliably measure emotions in real-world political settings. This paper evaluates leading mLLMs for video-based emotional arousal measurement using two complementary human-labeled video datasets: recordings created under laboratory conditions and real-world parliamentary debates. I find a critical lab-vs-field performance gap. In video created under laboratory conditions, mLLMs arousal scores approach human-level reliability with little to no demographic bias. However, in parliamentary debate recordings, all examined models' arousal scores correlate at best moderately with average human ratings and exhibit systematic bias by speaker gender and age. Neither relying on leading closed-source mLLMs nor computational noise mitigation strategies change this finding. Further, mLLMs underperform even in sentiment analysis when using video recordings instead of text transcripts of the same speeches. These findings reveal important limitations of current mLLMs for real-world political video analysis and establish a rigorous evaluation framework for tracking future developments.
翻译:研究越来越多地利用视听材料来分析政治传播中的情感。多模态大语言模型有望通过情境学习实现此类分析。然而,我们缺乏系统性的证据来证明这些模型能否在现实世界的政治环境中可靠地测量情感。本文使用两个互补的人工标注视频数据集——实验室条件下录制的视频和现实世界议会辩论录像——评估了领先的多模态大语言模型在基于视频的情感唤醒度测量方面的表现。我发现了一个关键的实验室与现场性能差距。在实验室条件下录制的视频中,多模态大语言模型的唤醒度评分接近人类水平的可靠性,且几乎没有人口统计学偏差。然而,在议会辩论录像中,所有被考察模型的唤醒度评分与人类平均评分的相关性最多仅为中等,并表现出由发言者性别和年龄导致的系统性偏差。无论是依赖领先的闭源多模态大语言模型,还是采用计算噪声缓解策略,都无法改变这一发现。此外,当使用视频录像而非相同演讲的文本转录稿时,多模态大语言模型甚至在情感分析任务中也表现不佳。这些发现揭示了当前多模态大语言模型在现实世界政治视频分析中的重要局限性,并建立了一个严格的评估框架以追踪未来的发展。