Recent text-to-video (T2V) diffusion models have made remarkable progress in generating high-quality videos. However, they often struggle to align with complex text prompts, particularly when multiple objects, attributes, or spatial relations are specified. We introduce VideoRepair, the first self-correcting, training-free, and model-agnostic video refinement framework that automatically detects fine-grained text-video misalignments and performs targeted, localized corrections. Our key insight is that even misaligned videos usually contain correctly generated regions that should be preserved rather than regenerated. Building on this observation, VideoRepair proposes a novel region-preserving refinement strategy with three stages: (i) misalignment detection, where MLLM-based evaluation with automatically generated evaluation questions identifies misaligned regions; (ii) refinement planning, which preserves correctly generated entities, segments their regions across frames, and constructs targeted prompts for misaligned areas; and (iii) localized refinement, which selectively regenerates problematic regions while preserving faithful content through joint optimization of preserved and newly generated areas. On two benchmarks, EvalCrafter and T2V-CompBench with four recent T2V backbones, VideoRepair achieves substantial improvements over recent baselines across diverse alignment metrics. Comprehensive ablations further demonstrate the efficiency, robustness, and interpretability of our framework.
翻译:近期文本到视频扩散模型在高质量视频生成方面取得了显著进展,但面对包含多物体、多属性或空间关系的复杂文本提示时,仍存在语义对齐困难。我们提出VideoRepair——首个无需训练、模型无关的自纠正视频优化框架,可自动检测细粒度文本-视频失配区域并实施精准局部修正。核心发现是:即使存在失配的视频中,往往仍包含应保留而非重新生成的正确区域。基于此,VideoRepair提出创新的区域保持优化策略,包含三个阶段:(i)失配检测环节,通过自动生成评估问题的多模态大模型评估,识别失配区域;(ii)优化规划环节,保留正确生成的实体,跨帧分割其区域,并为失配区域构建定向提示词;(iii)局部优化环节,通过联合优化保留区域与新生成区域,选择性地重建问题区域。在EvalCrafter和T2V-CompBench两个基准测试中,结合四种主流文本到视频骨干网络,VideoRepair在多种对齐指标上相较近期基线方法均实现显著提升。全面消融实验进一步验证了本框架的高效性、鲁棒性与可解释性。