Generalizing video matting models to real-world videos remains a significant challenge due to the scarcity of labeled data. To address this, we present Video Mask-to-Matte Model (VideoMaMa) that converts coarse segmentation masks into pixel accurate alpha mattes, by leveraging pretrained video diffusion models. VideoMaMa demonstrates strong zero-shot generalization to real-world footage, even though it is trained solely on synthetic data. Building on this capability, we develop a scalable pseudo-labeling pipeline for large-scale video matting and construct the Matting Anything in Video (MA-V) dataset, which offers high-quality matting annotations for more than 50K real-world videos spanning diverse scenes and motions. To validate the effectiveness of this dataset, we fine-tune the SAM2 model on MA-V to obtain SAM2-Matte, which outperforms the same model trained on existing matting datasets in terms of robustness on in-the-wild videos. These findings emphasize the importance of large-scale pseudo-labeled video matting and showcase how generative priors and accessible segmentation cues can drive scalable progress in video matting research.
翻译:由于标注数据的稀缺,将视频抠图模型推广到真实世界视频仍然是一个重大挑战。为此,我们提出了视频掩码转蒙版模型(VideoMaMa),该模型通过利用预训练的视频扩散模型,将粗糙的分割掩码转换为像素级精确的Alpha蒙版。尽管仅使用合成数据进行训练,VideoMaMa在真实世界视频素材上展现出强大的零样本泛化能力。基于此能力,我们开发了一个可扩展的伪标注流水线用于大规模视频抠图,并构建了视频任意抠图(MA-V)数据集,该数据集为超过5万个涵盖多样化场景和运动的真实世界视频提供了高质量的抠图标注。为了验证该数据集的有效性,我们在MA-V上对SAM2模型进行微调,得到SAM2-Matte模型。相较于在现有抠图数据集上训练的相同模型,SAM2-Matte在真实世界视频上的鲁棒性表现更优。这些发现凸显了大规模伪标注视频抠图的重要性,并展示了生成先验与易获取的分割线索如何推动视频抠图研究实现可扩展的进展。