Maintaining background consistency while enhancing foreground quality remains a core challenge in video editing. Injecting full-image information often leads to background artifacts, whereas rigid background locking severely constrains the model's capacity for foreground generation. To address this issue, we propose KV-Lock, a training-free framework tailored for DiT-based video diffusion models. Our core insight is that the hallucination metric (variance of denoising prediction) directly quantifies generation diversity, which is inherently linked to the classifier-free guidance (CFG) scale. Building upon this, KV-Lock leverages diffusion hallucination detection to dynamically schedule two key components: the fusion ratio between cached background key-values (KVs) and newly generated KVs, and the CFG scale. When hallucination risk is detected, KV-Lock strengthens background KV locking and simultaneously amplifies conditional guidance for foreground generation, thereby mitigating artifacts and improving generation fidelity. As a training-free, plug-and-play module, KV-Lock can be easily integrated into any pre-trained DiT-based models. Extensive experiments validate that our method outperforms existing approaches in improved foreground quality with high background fidelity across various video editing tasks.
翻译:在视频编辑中,如何在提升前景质量的同时保持背景一致性,仍是一个核心挑战。注入全图信息常导致背景伪影,而严格的背景锁定则会严重限制模型的前景生成能力。为解决此问题,我们提出了KV-Lock,一个专为基于DiT的视频扩散模型设计的免训练框架。我们的核心洞见是:幻觉度量(去噪预测的方差)直接量化了生成多样性,而该多样性本质上与无分类器引导(CFG)尺度相关联。基于此,KV-Lock利用扩散幻觉检测来动态调度两个关键组件:缓存的背景键值(KVs)与新生成KVs之间的融合比率,以及CFG尺度。当检测到幻觉风险时,KV-Lock会加强背景KV锁定,同时增强针对前景生成的条件引导,从而减轻伪影并提升生成保真度。作为一个免训练、即插即用的模块,KV-Lock可轻松集成到任何预训练的基于DiT的模型中。大量实验验证,在各种视频编辑任务中,我们的方法在提升前景质量的同时保持了高背景保真度,优于现有方法。