Existing video object removal methods predominantly rely on diffusion models following a noise-to-data paradigm, where generation starts from uninformative Gaussian noise. This approach discards the rich structural and contextual priors present in the original input video. Consequently, such methods often lack sufficient guidance, leading to incomplete object erasure or the synthesis of implausible content that conflicts with the scene's physical logic. In this paper, we reformulate video object removal as a video-to-video translation task via a stochastic bridge model. Unlike noise-initialized methods, our framework establishes a direct stochastic path from the source video (with objects) to the target video (objects removed). This bridge formulation effectively leverages the input video as a strong structural prior, guiding the model to perform precise removal while ensuring that the filled regions are logically consistent with the surrounding environment. To address the trade-off where strong bridge priors hinder the removal of large objects, we propose a novel adaptive mask modulation strategy. This mechanism dynamically modulates input embeddings based on mask characteristics, balancing background fidelity with generative flexibility. Extensive experiments demonstrate that our approach significantly outperforms existing methods in both visual quality and temporal consistency. The project page is https://bridgeremoval.github.io/.
翻译:现有视频对象移除方法主要遵循噪声到数据的范式,依赖从无信息的高斯噪声开始生成的扩散模型。这种方法丢弃了原始输入视频中丰富的结构和上下文先验。因此,此类方法通常缺乏足够的引导,导致对象擦除不完整或生成与场景物理逻辑相冲突的不可信内容。在本文中,我们通过随机桥模型将视频对象移除重新表述为视频到视频的转换任务。与噪声初始化方法不同,我们的框架建立了一条从源视频(含对象)到目标视频(对象已移除)的直接随机路径。这种桥式表述有效地将输入视频作为强大的结构先验,引导模型执行精确移除,同时确保填充区域在逻辑上与周围环境保持一致。为了解决强桥先验阻碍大对象移除的权衡问题,我们提出了一种新颖的自适应掩码调制策略。该机制根据掩码特征动态调制输入嵌入,在背景保真度与生成灵活性之间取得平衡。大量实验表明,我们的方法在视觉质量和时间一致性方面均显著优于现有方法。项目页面为 https://bridgeremoval.github.io/。