Can out-of-the-box pretrained Large Language Models (LLMs) detect human affect successfully when observing a video? To address this question, for the first time, we evaluate comprehensively the capacity of popular LLMs for successfully predicting continuous affect annotations of videos when prompted by a sequence of text and video frames in a multimodal fashion. In this paper, we test LLMs' ability to correctly label changes of in-game engagement in 80 minutes of annotated videogame footage from 20 first-person shooter games of the GameVibe corpus. We run over 4,800 experiments to investigate the impact of LLM architecture, model size, input modality, prompting strategy, and ground truth processing method on engagement prediction. Our findings suggest that while LLMs rightfully claim human-like performance across multiple domains and able to outperform traditional machine learning baselines, they generally fall behind continuous experience annotations provided by humans. We examine some of the underlying causes for a fluctuating performance across games, highlight the cases where LLMs exceed expectations, and draw a roadmap for the further exploration of automated emotion labelling via LLMs.
翻译:能否直接使用预训练的大型语言模型(LLMs)在观察视频时成功检测人类情感?为回答这一问题,我们首次全面评估了主流LLMs在多模态提示(通过文本序列与视频帧序列)下成功预测视频连续情感标注的能力。本文测试了LLMs对GameVibe语料库中20款第一人称射击游戏、总计80分钟已标注游戏录像中游戏参与度变化的正确标注能力。我们进行了超过4800次实验,以探究LLM架构、模型规模、输入模态、提示策略及真实标注处理方法对参与度预测的影响。研究结果表明,尽管LLMs在多个领域合理宣称达到类人性能且能超越传统机器学习基线模型,但其整体表现仍落后于人类提供的连续体验标注。我们分析了模型在不同游戏中表现波动的一些根本原因,指出了LLMs超出预期的案例,并为通过LLMs实现自动化情感标注的进一步探索绘制了路线图。