In this paper, we introduce semi-supervised video object segmentation (VOS) to panoptic wild scenes and present a large-scale benchmark as well as a baseline method for it. Previous benchmarks for VOS with sparse annotations are not sufficient to train or evaluate a model that needs to process all possible objects in real-world scenarios. Our new benchmark (VIPOSeg) contains exhaustive object annotations and covers various real-world object categories which are carefully divided into subsets of thing/stuff and seen/unseen classes for comprehensive evaluation. Considering the challenges in panoptic VOS, we propose a strong baseline method named panoptic object association with transformers (PAOT), which uses panoptic identification to associate objects with a pyramid architecture on multiple scales. Experimental results show that VIPOSeg can not only boost the performance of VOS models by panoptic training but also evaluate them comprehensively in panoptic scenes. Previous methods for classic VOS still need to improve in performance and efficiency when dealing with panoptic scenes, while our PAOT achieves SOTA performance with good efficiency on VIPOSeg and previous VOS benchmarks. PAOT also ranks 1st in the VOT2022 challenge. Our dataset is available at https://github.com/yoxu515/VIPOSeg-Benchmark.
翻译:本文提出将半监督视频对象分割(VOS)引入全景野种场景,并为其构建大规模基准数据集及基线方法。现有基于稀疏标注的VOS基准在训练或评估需处理真实场景中所有可能对象的模型时存在不足。我们的新基准VIPOSeg包含完备的对象标注,覆盖各类真实世界对象类别,并精心划分为物体/材质与可见/不可见子集以实现全面评估。针对全景VOS的挑战,我们提出一种强基线方法——基于Transformer的全景对象关联(PAOT),该方法利用全景识别通过多尺度金字塔架构关联对象。实验表明,VIPOSeg不仅可通过全景训练提升VOS模型性能,还能在全景场景中全面评估模型。经典VOS方法在处理全景场景时仍需改进性能与效率,而我们的PAOT在VIPOSeg及以往VOS基准上实现了SOTA性能与良好效率。PAOT还在VOT2022挑战赛中排名第一。数据集地址:https://github.com/yoxu515/VIPOSeg-Benchmark。