Video matting remains limited by the scale and realism of existing datasets. While leveraging segmentation data can enhance semantic stability, the lack of effective boundary supervision often leads to segmentation-like mattes lacking fine details. To this end, we introduce a learned Matting Quality Evaluator (MQE) that assesses semantic and boundary quality of alpha mattes without ground truth. It produces a pixel-wise evaluation map that identifies reliable and erroneous regions, enabling fine-grained quality assessment. The MQE scales up video matting in two ways: (1) as an online matting-quality feedback during training to suppress erroneous regions, providing comprehensive supervision, and (2) as an offline selection module for data curation, improving annotation quality by combining the strengths of leading video and image matting models. This process allows us to build a large-scale real-world video matting dataset, VMReal, containing 28K clips and 2.4M frames. To handle large appearance variations in long videos, we introduce a reference-frame training strategy that incorporates long-range frames beyond the local window for effective training. Our MatAnyone 2 achieves state-of-the-art performance on both synthetic and real-world benchmarks, surpassing prior methods across all metrics.
翻译:视频抠图技术仍受限于现有数据集的规模和真实性。尽管利用分割数据可以增强语义稳定性,但缺乏有效的边界监督往往导致抠图结果呈现类似分割的粗糙效果,缺失精细细节。为此,我们引入了一种学习式抠图质量评估器(MQE),该评估器无需真实标注即可评估阿尔法遮罩的语义质量和边界质量。它能生成像素级的评估图,识别出可靠区域和错误区域,从而实现细粒度的质量评估。MQE通过两种方式推动视频抠图的规模化:(1)作为训练过程中的在线抠图质量反馈机制,抑制错误区域,提供全面的监督;(2)作为数据筛选的离线选择模块,结合领先的视频与图像抠图模型的优势,提升标注质量。通过该流程,我们构建了一个大规模真实世界视频抠图数据集VMReal,包含28K个视频片段和240万帧图像。为处理长视频中大幅度的外观变化,我们引入了一种参考帧训练策略,该策略将超出局部窗口的长范围帧纳入训练,以实现高效学习。我们的MatAnyone 2在合成与真实世界基准测试中均取得了最先进的性能,在所有指标上超越了先前方法。