TACO：利用扩散模型实现野外视频非模态补全 (TACO: Taming Diffusion for in-the-wild Video Amodal Completion)

Humans can infer complete shapes and appearances of objects from limited visual cues, relying on extensive prior knowledge of the physical world. However, completing partially observable objects while ensuring consistency across video frames remains challenging for existing models, especially for unstructured, in-the-wild videos. This paper tackles the task of Video Amodal Completion (VAC), which aims to generate the complete object consistently throughout the video given a visual prompt specifying the object of interest. Leveraging the rich, consistent manifolds learned by pre-trained video diffusion models, we propose a conditional diffusion model, TACO, that repurposes these manifolds for VAC. To enable its effective and robust generalization to challenging in-the-wild scenarios, we curate a large-scale synthetic dataset with multiple difficulty levels by systematically imposing occlusions onto un-occluded videos. Building on this, we devise a progressive fine-tuning paradigm that starts with simpler recovery tasks and gradually advances to more complex ones. We demonstrate TACO's versatility on a wide range of in-the-wild videos from Internet, as well as on diverse, unseen datasets commonly used in autonomous driving, robotic manipulation, and scene understanding. Moreover, we show that TACO can be effectively applied to various downstream tasks like object reconstruction and pose estimation, highlighting its potential to facilitate physical world understanding and reasoning. Our project page is available at https://jason-aplp.github.io/TACO.

翻译：人类能够凭借对物理世界的广泛先验知识，从有限的视觉线索中推断物体的完整形状与外观。然而，现有模型在确保视频帧间一致性的前提下补全部分可见物体仍面临挑战，尤其是在非结构化的野外视频中。本文致力于解决视频非模态补全任务，其目标是在给定指定目标物体的视觉提示条件下，生成视频中保持时序一致性的完整物体。通过利用预训练视频扩散模型学习到的丰富且一致的流形特征，我们提出条件扩散模型TACO，将该流形特征重用于视频非模态补全任务。为实现模型在复杂野外场景下的有效鲁棒泛化，我们通过系统化地对无遮挡视频施加多级遮挡，构建了包含不同难度层级的大规模合成数据集。在此基础上，我们设计了渐进式微调范式，从简单的修复任务开始，逐步过渡至更复杂的补全任务。我们在来自互联网的广泛野外视频，以及自动驾驶、机器人操作和场景理解领域常用的多种未见数据集上验证了TACO的泛化能力。此外，我们证明TACO可有效应用于物体重建与姿态估计等下游任务，凸显了其在促进物理世界理解与推理方面的潜力。项目页面详见 https://jason-aplp.github.io/TACO。