CrowdVLA: Embodied Vision-Language-Action Agents for Context-Aware Crowd Simulation

Crowds do not merely move; they decide. Human navigation is inherently contextual: people interpret the meaning of space, social norms, and potential consequences before acting. Sidewalks invite walking, crosswalks invite crossing, and deviations are weighed against urgency and safety. Yet most crowd simulation methods reduce navigation to geometry and collision avoidance, producing motion that is plausible but rarely intentional. We introduce CrowdVLA, a new formulation of crowd simulation that models each pedestrian as a Vision-Language-Action (VLA) agent. Instead of replaying recorded trajectories, CrowdVLA enables agents to interpret scene semantics and social norms from visual observations and language instructions, and to select actions through consequence-aware reasoning. CrowdVLA addresses three key challenges-limited agent-centric supervision in crowd datasets, unstable per-frame control, and success-biased datasets-through: (i) agent-centric visual supervision via semantically reconstructed environments and Low-Rank Adaptation (LoRA) fine-tuning of a pretrained vision-language model, (ii) a motion skill action space that bridges symbolic decision making and continuous locomotion, and (iii) exploration-based question answering that exposes agents to counterfactual actions and their outcomes through simulation rollouts. Our results shift crowd simulation from motion-centric synthesis toward perception-driven, consequence-aware decision making, enabling crowds that move not just realistically, but meaningfully.

翻译：人群并非仅仅移动，而是做出决策。人类导航本质上是情境化的：人们在行动之前会解读空间含义、社会规范以及潜在后果。人行道引导行走，斑马线引导穿越，而偏离行为则会根据紧急性和安全性进行权衡。然而，大多数人群模拟方法将导航简化为几何与碰撞避免，产生的运动虽然貌似合理，却鲜少具有意图性。我们提出CrowdVLA，一种将每个行人建模为视觉-语言-动作（VLA）智能体的人群模拟新范式。CrowdVLA 并非回放记录轨迹，而是使智能体能够从视觉观测和语言指令中解读场景语义与社会规范，并通过后果感知推理选择动作。CrowdVLA 针对三个关键挑战——人群数据集中有限的以智能体为中心的监督、不稳定的逐帧控制以及成功偏置的数据集——提出了以下解决方案：（i）通过语义重构环境和低秩自适应（LoRA）微调预训练视觉语言模型，实现以智能体为中心的视觉监督；（ii）一种连接符号决策与连续运动的运动技能动作空间；（iii）基于探索的问答，通过模拟展开使智能体接触反事实动作及其结果。我们的研究将人群模拟从以运动为中心的合成，转向感知驱动、后果感知的决策制定，使人群的移动不仅真实，而且富有意义。