VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

Xiangbo Gao,Sicong Jiang,Bangya Liu,Xinghao Chen,Minglai Yang,Siyuan Yang,Mingyang Wu,Jiongze Yu,Qi Zheng,Haozhi Wang,Jiayi Zhang,Jared Yang,Jie Yang,Zihan Wang,Qing Yin,Zhengzhong Tu

As AI-assisted video creation becomes increasingly practical, instruction-guided video editing has become essential for refining generated or captured footage to meet professional requirements. Yet the field still lacks both a large-scale human-annotated dataset with complete editing examples and a standardized evaluator for comparing editing systems. Existing resources are limited by small scale, missing edited outputs, or the absence of human quality labels, while current evaluation often relies on expensive manual inspection or generic vision-language model judges that are not specialized for editing quality. We introduce VEFX-Dataset, a human-annotated dataset containing 5,049 video editing examples across 9 major editing categories and 32 subcategories, each labeled along three decoupled dimensions: Instruction Following, Rendering Quality, and Edit Exclusivity. Building on VEFX-Dataset, we propose VEFX-Reward, a reward model designed specifically for video editing quality assessment. VEFX-Reward jointly processes the source video, the editing instruction, and the edited video, and predicts per-dimension quality scores via ordinal regression. We further release VEFX-Bench, a benchmark of 300 curated video-prompt pairs for standardized comparison of editing systems. Experiments show that VEFX-Reward aligns more strongly with human judgments than generic VLM judges and prior reward models on both standard IQA/VQA metrics and group-wise preference evaluation. Using VEFX-Reward as an evaluator, we benchmark representative commercial and open-source video editing systems, revealing a persistent gap between visual plausibility, instruction following, and edit locality in current models.

翻译：随着AI辅助视频创作日益实用化，指令引导的视频编辑已成为精炼生成或采集素材以满足专业需求的关键技术。然而，该领域仍缺乏兼具全量编辑示例的大规模人工标注数据集，以及用于比较编辑系统的标准化评估工具。现有资源受限于规模小、缺失编辑输出或缺少人工质量标签，而当前评估常依赖昂贵的人工检查或未针对编辑质量特化的通用视觉语言模型评判。我们提出VEFX-Dataset——一个包含5049个视频编辑示例的人工标注数据集，涵盖9大编辑类别与32个子类别，每个样本沿三个解耦维度标注：指令遵循度、渲染质量与编辑专一性。基于VEFX-Dataset，我们提出VEFX-Reward——专为视频编辑质量评估设计的奖励模型。该模型联合处理源视频、编辑指令与编辑后视频，通过有序回归预测各维度质量分数。进一步地，我们发布VEFX-Bench——包含300个精心筛选视频-提示对的基准测试集，用于标准化编辑系统比较。实验表明，在标准IQA/VQA指标与分组偏好评估中，VEFX-Reward比通用VLM评判器及先前奖励模型更贴近人类判断。以VEFX-Reward为评估工具，我们对代表性商业及开源视频编辑系统进行基准测试，揭示当前模型在视觉合理性、指令遵循度与编辑局部性之间的持续差距。