Inferring human engagement from gameplay video is important for game design and player-experience research, yet it remains unclear whether vision--language models (VLMs) can infer such latent psychological states from visual cues alone. Using the GameVibe Few-Shot dataset across nine first-person shooter games, we evaluate three VLMs under six prompting strategies, including zero-shot prediction, theory-guided prompts grounded in Flow, GameFlow, Self-Determination Theory, and MDA, and retrieval-augmented prompting. We consider both pointwise engagement prediction and pairwise prediction of engagement change between consecutive windows. Results show that zero-shot VLM predictions are generally weak and often fail to outperform simple per-game majority-class baselines. Memory- or retrieval-augmented prompting improves pointwise prediction in some settings, whereas pairwise prediction remains consistently difficult across strategies. Theory-guided prompting alone does not reliably help and can instead reinforce surface-level shortcuts. These findings suggest a perception--understanding gap in current VLMs: although they can recognize visible gameplay cues, they still struggle to robustly infer human engagement across games.
翻译:从游戏视频推断玩家投入度对游戏设计与玩家体验研究至关重要,但视觉语言模型(VLM)能否仅凭视觉线索推断此类潜在心理状态尚不明确。基于横跨九款第一人称射击游戏的GameVibe小样本数据集,我们采用六种提示策略评估了三种VLM,包括零样本预测、基于心流理论、游戏心流、自我决定理论与MDA的理论引导提示,以及检索增强提示。我们同时考虑逐点投入度预测与连续窗口间投入度变化的成对预测。结果表明,VLM的零样本预测整体能力较弱,通常难以超越每款游戏的简单多数类基线。记忆或检索增强提示在部分场景中改善了逐点预测,但成对预测在各策略下始终存在困难。单独采用理论引导提示不仅未能可靠提升效果,反而可能强化浅层捷径。这些发现揭示了当前VLM的感知-理解鸿沟:尽管能够识别可见游戏线索,但跨游戏稳健推断玩家投入度的能力仍显不足。