Video Instance Removal (VIR) requires removing target objects while maintaining background integrity and physical consistency, such as specular reflections and illumination interactions. Despite advancements in text-guided editing, current benchmarks primarily assess visual plausibility, often overlooking the physical causalities, such as lingering shadows, triggered by object removal. We introduce the Physics-Aware Video Instance Removal (PVIR) benchmark, featuring 95 high-quality videos annotated with instance-accurate masks and removal prompts. PVIR is partitioned into Simple and Hard subsets, the latter explicitly targeting complex physical interactions. We evaluate four representative methods, PISCO-Removal, UniVideo, DiffuEraser, and CoCoCo, using a decoupled human evaluation protocol across three dimensions to isolate semantic, visual, and spatial failures: instruction following, rendering quality, and edit exclusivity. Our results show that PISCO-Removal and UniVideo achieve state-of-the-art performance, while DiffuEraser frequently introduces blurring artifacts and CoCoCo struggles significantly with instruction following. The persistent performance drop on the Hard subset highlights the ongoing challenge of recovering complex physical side effects.
翻译:视频实例移除要求移除目标对象的同时保持背景完整性与物理一致性,例如镜面反射和光照交互。尽管文本引导编辑领域取得进展,现有基准主要评估视觉合理性,往往忽视移除物体所引发的物理因果关系(如残留阴影)。我们提出物理感知视频实例移除基准,包含95个高质量视频,配备实例级精确掩码与移除提示。该基准划分为简单子集与困难子集,后者明确针对复杂物理交互。我们采用解耦式人工评估协议,从指令遵循度、渲染质量与编辑唯一性三个维度分离语义、视觉与空间故障,评估了四种代表性方法:PISCO-Removal、UniVideo、DiffuEraser与CoCoCo。结果表明PISCO-Removal与UniVideo取得最优性能,而DiffuEraser频繁引入模糊伪影,CoCoCo在指令遵循方面存在显著困难。困难子集上持续存在的性能下降凸显了恢复复杂物理副作用的长期挑战。