SafeManip: A Property-Driven Benchmark for Temporal Safety Evaluation in Robotic Manipulation

Robotic manipulation is typically evaluated by task success, but successful completion does not guarantee safe execution. Many safety failures are temporal: a robot may touch a clean surface after contamination or release an object before it is fully inside an enclosure. We introduce SafeManip, a property-driven benchmark to explicitly evaluate temporal safety properties in robotic manipulation, moving beyond prior evaluations that largely focus on task completion or per-state constraint violations. SafeManip defines reusable safety templates over finite executions using Linear Temporal Logic over finite traces (LTLf). It maps observed rollouts to symbolic predicate traces and evaluates them with LTLf-based monitors. Its property suite covers eight manipulation safety categories: collision and contact safety, grasp stability, release stability, cross-contamination, action onset, mechanism recovery, object containment, and enclosure access. Templates can be instantiated with task-specific objects, fixtures, regions, or skills, allowing the same safety specifications to generalize across tasks and environments. We evaluate SafeManip on six vision-language-action policies, including $π_0$, $π_{0.5}$, GR00T, and their training variants, across 50 RoboCasa365 household tasks. Results show that even strong models often behave unsafely. Task-success gains do not reliably translate into safer execution: many successful rollouts remain unsafe, while longer-horizon or more complex tasks expose more violations. SafeManip provides a reusable evaluation layer for diagnosing temporal safety failures and measuring safe success beyond task completion.

翻译：机器人操作通常通过任务成功率来评估，但成功完成并不能保证执行过程的安全性。许多安全故障具有时态性：机器人在接触污染表面后可能触碰洁净区域，或在物体完全进入封闭空间前提前释放。我们提出SafeManip，一种属性驱动的基准测试框架，旨在显式评估机器人操作中的时态安全性，超越了以往主要关注任务完成或单步约束违反的评估方法。SafeManip基于有限迹线性时态逻辑（LTLf），在有限执行轨迹上定义可重用的安全模板。它将观测到的轨迹映射为符号谓词迹，并通过基于LTLf的监控器进行评估。其属性集涵盖八类操作安全范畴：碰撞与接触安全、抓取稳定性、释放稳定性、交叉污染、动作起始、机制恢复、物体包容及封闭空间访问。这些模板可针对特定任务的对象、夹具、区域或技能进行实例化，使得相同安全规范能泛化至不同任务与环境。我们在50个RoboCasa365家庭任务上评估了SafeManip在六种视觉-语言-动作策略（包括$π_0$、$π_{0.5}$、GR00T及其训练变体）中的表现。结果表明，即使是强大的模型也常表现出不安全行为。任务成功率的提升并不必然转化为更安全的执行：大量成功轨迹仍存在安全隐患，而长时域或更复杂任务会暴露更多违规行为。SafeManip为诊断时态安全故障及衡量超越任务完成的安全成功提供了可复用的评估层。