As AI systems increasingly assist humans in physical tasks, ensuring safety becomes paramount -- physical actions carry immediate and irreversible consequences that digital errors do not. We introduce the Vision-Language Embodied Safety Agent (VLESA), a framework that monitors human activities from egocentric video and triggers real-time safety interventions when dangerous actions are predicted. VLESA addresses intent-dependent safety where identical actions can be safe or dangerous depending on context. A dataset pairing egocentric frames with goal-conditioned safety annotations is introduced, enabling a goal-conditioned safety Q-filter trained via GRPO that evaluates actions with respect to inferred intent without retraining. On top of that, an intent-action prediction agent is proposed to jointly infer goals and predict future actions from video. On the ASIMOV-2.0 benchmark, VLESA achieves higher intervention accuracy at the exact ground-truth frame compared to baselines, while the GRPO-trained Q-filter improves action safety by over 41 percentage points through goal-conditioned constrained decoding. Code is available at https://github.com/HanjiangHu/VLESA.
翻译:随着人工智能系统越来越多地协助人类完成物理任务,确保安全性变得至关重要——物理动作会带来即时且不可逆的后果,这与数字错误截然不同。我们提出了视觉-语言具身安全智能体(VLESA),该框架通过第一人称视角视频监测人类活动,并在预测到危险动作时触发实时安全干预。VLESA旨在应对意图依赖型安全挑战——即相同动作的安全性可能因上下文而异。我们引入了一个配对数据集,将第一人称视频帧与目标条件安全标注相结合,并训练了一个基于GRPO的目标条件安全Q滤波器,该滤波器无需重新训练即可根据推断意图评估动作。在此基础上,我们提出了一种意图-动作预测智能体,用于从视频中联合推断目标并预测未来动作。在ASIMOV-2.0基准测试中,VLESA在精确的真实标注帧上实现了比基线更高的干预准确率,同时通过GRPO训练的Q滤波器结合目标条件约束解码,将动作安全性提升了超过41个百分点。代码已开源:https://github.com/HanjiangHu/VLESA。