MOSE: A New Dataset for Video Object Segmentation in Complex Scenes

Video object segmentation (VOS) aims at segmenting a particular object throughout the entire video clip sequence. The state-of-the-art VOS methods have achieved excellent performance (e.g., 90+% J&F) on existing datasets. However, since the target objects in these existing datasets are usually relatively salient, dominant, and isolated, VOS under complex scenes has rarely been studied. To revisit VOS and make it more applicable in the real world, we collect a new VOS dataset called coMplex video Object SEgmentation (MOSE) to study the tracking and segmenting objects in complex environments. MOSE contains 2,149 video clips and 5,200 objects from 36 categories, with 431,725 high-quality object segmentation masks. The most notable feature of MOSE dataset is complex scenes with crowded and occluded objects. The target objects in the videos are commonly occluded by others and disappear in some frames. To analyze the proposed MOSE dataset, we benchmark 18 existing VOS methods under 4 different settings on the proposed MOSE dataset and conduct comprehensive comparisons. The experiments show that current VOS algorithms cannot well perceive objects in complex scenes. For example, under the semi-supervised VOS setting, the highest J&F by existing state-of-the-art VOS methods is only 59.4% on MOSE, much lower than their ~90% J&F performance on DAVIS. The results reveal that although excellent performance has been achieved on existing benchmarks, there are unresolved challenges under complex scenes and more efforts are desired to explore these challenges in the future. The proposed MOSE dataset has been released at https://henghuiding.github.io/MOSE.

翻译：视频对象分割（VOS）旨在对整个视频片段序列中的特定对象进行分割。现有最先进的VOS方法在已有数据集上取得了优异性能（例如J&F达到90%以上）。然而，由于这些现有数据集中的目标对象通常相对显著、主导且孤立，因此复杂场景下的VOS研究较少。为了重新审视VOS并使其更适用于现实世界，我们收集了一个名为coMplex video Object SEgmentation（MOSE）的新VOS数据集，研究复杂环境中的对象跟踪与分割。MOSE包含来自36个类别的2,149个视频片段和5,200个对象，共431,725个高质量对象分割掩码。MOSE数据集最显著的特点是场景复杂，存在密集和遮挡的对象。视频中的目标对象常被其他对象遮挡，并在部分帧中消失。为分析提出的MOSE数据集，我们在该数据集上对18种现有VOS方法在4种不同设置下进行基准测试，并开展全面比较。实验表明，当前VOS算法无法很好地感知复杂场景中的对象。例如，在半监督VOS设置下，现有最先进VOS方法在MOSE上的最高J&F仅为59.4%，远低于其在DAVIS上约90%的J&F性能。结果揭示，尽管在现有基准测试上取得了卓越性能，但复杂场景下仍存在未解决的挑战，未来需要更多努力探索这些挑战。提出的MOSE数据集已发布于https://henghuiding.github.io/MOSE。