Multimodal large language models (MLLMs) have shown strong performance on objective tasks such as video understanding and reasoning. However, it remains unclear whether they can approximate subjective human responses, which depend not only on content comprehension but also on individuals' social contexts. To address this gap, we evaluate MLLMs as synthetic participants in an emerging task: assessing perceived sensory engagement with short videos. Grounded in the Perceived Message Sensation Value (PMSV) framework, we compare ratings from recruited human participants and profile-conditioned MLLM simulations (n=673) using a 17-item scale measuring emotional arousal, dramatic impact, and novelty. We find that even leading MLLMs (Gemini 3 Flash and Qwen 3 Omni) show limited agreement with human participants. The models exhibit distinct downward mean-shift and central-tendency biases in their rating distributions. They both introduce and flatten subgroup differences, while showing inconsistent sensitivity to participant profiles. Prompting strategies affect these metrics differently, modestly improving some aspects while worsening others. These results highlight both the challenges and opportunities of developing MLLMs as synthetic participants in video-based research. Data and code: https://github.com/MINDLab25/mllm-human-simulation-eval
翻译:多模态大语言模型(MLLMs)在视频理解与推理等客观任务中展现出强大性能,但其能否近似依赖内容理解与个体社会背景的主观人类响应仍不明确。为填补这一空白,本研究将MLLMs作为合成参与者应用于一项新兴任务:评估短视频的感知感官参与度。基于感知信息感官价值(PMSV)框架,我们采用七项量表(测量情绪唤醒度、戏剧冲击力与新颖性),对比了招募人类参与者与配置条件化MLLM模拟(n=673)的评分结果。研究发现,即便是最先进的MLLM(如Gemini 3 Flash与Qwen 3 Omni)与人类参与者的一致性仍十分有限。这些模型在评分分布中呈现显著的负向均值偏移和中心化偏差。它们既会引入也会扁平化子组差异,同时对参与者配置的敏感性表现不一致。不同提示策略对各项指标的影响各异,在改善部分指标的同时却导致其他指标恶化。这些结果揭示了将MLLMs发展为视频研究中合成参与者所面临的挑战与机遇。数据与代码:https://github.com/MINDLab25/mllm-human-simulation-eval