SafeManip: A Property-Driven Benchmark for Temporal Safety Evaluation in Robotic Manipulation

Robotic manipulation is typically evaluated by task success, but successful completion does not guarantee safe execution. Many safety failures are temporal: a robot may touch a clean surface after contamination or release an object before it is fully inside an enclosure. We introduce SafeManip, a property-driven benchmark to explicitly evaluate temporal safety properties in robotic manipulation, moving beyond prior evaluations that largely focus on task completion or per-state constraint violations. SafeManip defines reusable safety templates over finite executions using Linear Temporal Logic over finite traces (LTLf). It maps observed rollouts to symbolic predicate traces and evaluates them with LTLf-based monitors. Its property suite covers eight manipulation safety categories: collision and contact safety, grasp stability, release stability, cross-contamination, action onset, mechanism recovery, object containment, and enclosure access. Templates can be instantiated with task-specific objects, fixtures, regions, or skills, allowing the same safety specifications to generalize across tasks and environments. We evaluate SafeManip on six vision-language-action policies, including $π_0$, $π_{0.5}$, GR00T, and their training variants, across 50 RoboCasa365 household tasks. Results show that even strong models often behave unsafely. Task-success gains do not reliably translate into safer execution: many successful rollouts remain unsafe, while longer-horizon or more complex tasks expose more violations. SafeManip provides a reusable evaluation layer for diagnosing temporal safety failures and measuring safe success beyond task completion.

翻译：机器人操作通常通过任务完成度来评估，但成功完成并不能保证执行过程的安全性。许多安全失效具有时间性：机器人在接触污染表面后触碰洁净区域，或在物体未完全进入容器前即松开。本文提出SafeManip——一种属性驱动的基准测试，旨在显式评估机器人操作中的时间安全属性，突破了以往主要关注任务完成度或单状态约束违例的评估范式。SafeManip利用有限迹线性时序逻辑（LTLf）在有限执行轨迹上定义可复用的安全模板，将观测到的操作轨迹映射为符号谓词迹，并通过基于LTLf的监控器对其进行评估。其属性套件覆盖八类操作安全范畴：碰撞与接触安全、抓取稳定性、释放稳定性、交叉污染、动作起始、机构恢复、物体容纳及封闭空间访问。这些模板可通过任务特定物体、夹具、区域或技能进行实例化，使相同安全规范可跨任务与场景泛化。我们在50个RoboCasa365家务任务上，对包括$π_0$、$π_{0.5}$、GR00T及其训练变体在内的六种视觉-语言-动作策略进行了评估。结果表明，即使强大的模型也常表现出不安全行为。任务成功率的提升并未可靠转化为更安全的执行：大量成功轨迹仍存在安全隐患，而更长周期或更复杂的任务会暴露更多违例情况。SafeManip为诊断时间安全失效、度量超越任务完成度的安全成功提供了可复用的评估层。