Safe social navigation requires robots to distinguish people from ordinary obstacles and to react before danger becomes imminent. We show that pretrained Vision-Language-Action (VLA) models already encode pedestrian-object distinctions and future collision signals in their internal representations, but behavior cloning fails to translate these signals into socially appropriate actions. To address this mismatch, we propose SALSA, a two-stage annotation-free post-training framework: (1) social behavioral alignment bridges intermediate-layer social features to the action head and trains on counterfactual human-object scene pairs to break visual saliency shortcuts; (2) temporal safety alignment provides automatically generated future-risk supervision to enable anticipatory collision avoidance. On SCAND and real-world deployment, SALSA reduces near-collisions by 86.4% and improves social counterfactual accuracy from 53% to 93%, demonstrating that safer social navigation can be achieved by teaching VLA policies to act on representations they already possess. These results show that pretrained VLA policies can be adapted for safer social navigation by better aligning their latent representations with action generation.
翻译:安全的社交导航要求机器人区分行人与普通障碍物,并在危险迫近前做出反应。我们证明,预训练的视觉-语言-动作(VLA)模型已在其内部表征中编码了行人-物体区分及未来碰撞信号,但行为克隆未能将这些信号转化为符合社交规范的动作。为解决这一不匹配问题,我们提出SALSA,一种两阶段无标注后训练框架:(1)社交行为对齐将中间层社交特征桥接到动作头,并在反事实的人-物场景对上训练,以打破视觉显著性捷径;(2)时间安全对齐提供自动生成的未来风险监督,实现预判性碰撞规避。在SCAND数据集及真实世界部署中,SALSA将近碰撞事件减少86.4%,并将社交反事实准确率从53%提升至93%,证明通过教导VLA策略利用其已拥有的表征来生成动作,可实现更安全的社交导航。这些结果表明,预训练的VLA策略可通过更好对齐潜在表征与动作生成,适配于更安全的社交导航。