Safe social navigation requires robots to distinguish people from ordinary obstacles and to react before danger becomes imminent. We show that pretrained Vision-Language-Action (VLA) models already encode pedestrian-object distinctions and future collision signals in their internal representations, but behavior cloning fails to translate these signals into socially appropriate actions. To address this mismatch, we propose SALSA, a two-stage annotation-free post-training framework: (1) social behavioral alignment bridges intermediate-layer social features to the action head and trains on counterfactual human-object scene pairs to break visual saliency shortcuts; (2) temporal safety alignment provides automatically generated future-risk supervision to enable anticipatory collision avoidance. On SCAND and real-world deployment, SALSA reduces near-collisions by 86.4% and improves social counterfactual accuracy from 53% to 93%, demonstrating that safer social navigation can be achieved by teaching VLA policies to act on representations they already possess. These results show that pretrained VLA policies can be adapted for safer social navigation by better aligning their latent representations with action generation.
翻译:安全社交导航要求机器人能够区分行人与普通障碍物,并在危险逼近前做出反应。我们证明,预训练的视觉-语言-动作(VLA)模型在其内部表征中已编码行人与物体的区分信息以及未来碰撞信号,但行为克隆无法将这些信号转化为符合社交规范的动作。为了解决这一错配问题,我们提出SALSA——一种两阶段无标注后训练框架:(1)社交行为对齐将中间层社交特征桥接至动作头,并在反事实人-物场景对上训练,以打破视觉显著性捷径;(2)时间安全对齐提供自动生成的未来风险监督,实现预判性避碰。在SCAND基准及真实世界部署中,SALSA将近碰撞概率降低86.4%,并将社交反事实准确率从53%提升至93%,证明通过引导VLA策略基于其已具备的表征采取行动,可实现更安全的社交导航。这些结果表明,通过更好地对齐其潜在表征与动作生成,预训练的VLA策略可被适配应用于更安全的社交导航。