Despite the remarkable progress in text-driven video editing, generating coherent non-rigid deformations remains a critical challenge, often plagued by physical distortion and temporal flicker. To bridge this gap, we propose NRVBench, the first dedicated and comprehensive benchmark designed to evaluate non-rigid video editing. First, we curate a high-quality dataset consisting of 180 non-rigid motion videos from six physics-based categories, equipped with 2,340 fine-grained task instructions and 360 multiple-choice questions. Second, we propose NRVE-Acc, a novel evaluation metric based on Vision-Language Models that can rigorously assess physical compliance, temporal consistency, and instruction alignment, overcoming the limitations of general metrics in capturing complex dynamics. Third, we introduce a training-free baseline, VM-Edit, which utilizes a dual-region denoising mechanism to achieve structure-aware control, balancing structural preservation and dynamic deformation. Extensive experiments demonstrate that while current methods have shortcomings in maintaining physical plausibility, our method achieves excellent performance across both standard and proposed metrics. We believe the benchmark could serve as a standard testing platform for advancing physics-aware video editing.
翻译:尽管文本驱动视频编辑已取得显著进展,生成连贯的非刚性形变仍是关键挑战,常受物理失真与时序闪烁问题困扰。为填补此空白,我们提出首个专门且全面的非刚性视频编辑基准NRVBench。首先,我们构建了包含六个物理类别180个非刚性运动视频的高质量数据集,配备2,340条细粒度任务指令与360道选择题。其次,我们提出基于视觉语言模型的新型评估指标NRVE-Acc,可严格评估物理合规性、时序一致性与指令对齐性,克服通用指标在捕捉复杂动态方面的局限。第三,我们引入免训练基线方法VM-Edit,通过双区域去噪机制实现结构感知控制,平衡结构保持与动态形变。大量实验表明,当前方法在保持物理合理性方面存在不足,而我们的方法在标准指标与所提指标上均取得优异性能。我们相信该基准可作为推进物理感知视频编辑的标准测试平台。