Natural head rotation is critical for believable embodied virtual agents, yet this micro-level behavior remains largely underexplored. While head-rotation prediction algorithms could, in principle, reproduce this behavior, they typically focus on visually salient stimuli and overlook the cognitive motives that guide head rotation. This yields agents that look at conspicuous objects while overlooking obstacles or task-relevant cues, diminishing realism in a virtual environment. We introduce SCORE, a Symbolic Cognitive Reasoning framework for Embodied Head Rotation, a data-agnostic framework that produces context-aware head movements without task-specific training or hand-tuned heuristics. A controlled VR study (N=20) identifies five motivational drivers of human head movements: Interest, Information Seeking, Safety, Social Schema, and Habit. SCORE encodes these drivers as symbolic predicates, perceives the scene with a Vision-Language Model (VLM), and plans head poses with a Large Language Model (LLM). The framework employs a hybrid workflow: the VLM-LLM reasoning is executed offline, after which a lightweight FastVLM performs online validation to suppress hallucinations while maintaining responsiveness to scene dynamics. The result is an agent that predicts not only where to look but also why, generalizing to unseen scenes and multi-agent crowds while retaining behavioral plausibility.
翻译:自然的头部旋转对于可信的具身虚拟代理至关重要,然而这种微观层面的行为在很大程度上仍未得到充分探索。尽管头部旋转预测算法原则上可以复现这种行为,但它们通常侧重于视觉显著性刺激,而忽视了引导头部旋转的认知动机。这导致代理会注视显眼的物体,却忽略障碍物或任务相关线索,从而降低了虚拟环境中的真实感。我们提出了SCORE(具身头部旋转的符号认知推理框架),这是一个与数据无关的框架,无需任务特定训练或手动调整启发式规则,即可生成上下文感知的头部运动。一项受控的VR研究(N=20)识别了人类头部运动的五个动机驱动因素:兴趣、信息寻求、安全、社会图式和习惯。SCORE将这些驱动因素编码为符号谓词,通过视觉-语言模型(VLM)感知场景,并利用大语言模型(LLM)规划头部姿态。该框架采用混合工作流程:VLM-LLM推理在离线阶段执行,随后轻量级的FastVLM进行在线验证,以抑制幻觉,同时保持对场景动态的响应能力。其结果是产生一个不仅能预测注视方向,还能解释注视原因的代理,该代理能够泛化到未见过的场景和多代理人群,同时保持行为合理性。