The appearance of an object can be fleeting when it transforms. As eggs are broken or paper is torn, their color, shape and texture can change dramatically, preserving virtually nothing of the original except for the identity itself. Yet, this important phenomenon is largely absent from existing video object segmentation (VOS) benchmarks. In this work, we close the gap by collecting a new dataset for Video Object Segmentation under Transformations (VOST). It consists of more than 700 high-resolution videos, captured in diverse environments, which are 21 seconds long on average and densely labeled with instance masks. A careful, multi-step approach is adopted to ensure that these videos focus on complex object transformations, capturing their full temporal extent. We then extensively evaluate state-of-the-art VOS methods and make a number of important discoveries. In particular, we show that existing methods struggle when applied to this novel task and that their main limitation lies in over-reliance on static appearance cues. This motivates us to propose a few modifications for the top-performing baseline that improve its capabilities by better modeling spatio-temporal information. But more broadly, the hope is to stimulate discussion on learning more robust video object representations.
翻译:物体的外观在形变时可能是转瞬即逝的。当鸡蛋被打碎或纸张被撕裂时,其颜色、形状和纹理可能发生剧烈变化,除身份本身外几乎不保留任何原始特征。然而,这一重要现象在现有视频目标分割(VOS)基准测试中基本缺失。本研究通过收集新的形变视频目标分割(VOST)数据集来填补这一空白。该数据集包含超过700个高分辨率视频,采集自多样环境,平均时长21秒,并采用密集的实例掩码进行标注。我们采用精心设计的多步骤方法,确保这些视频聚焦于复杂目标形变,并完整捕捉其时间演变过程。随后,我们对现有最优视频目标分割方法进行了全面评估,并获得若干重要发现。研究表明,现有方法在应对这一新任务时存在困难,其主要局限在于过度依赖静态外观特征。这促使我们针对最优基线方法提出若干改进方案,通过更好建模时空信息来提升其能力。更广泛而言,我们期望能激发学界对学习更鲁棒视频目标表征的讨论。