Tracking objects with persistence in cluttered and dynamic environments remains a difficult challenge for computer vision systems. In this paper, we introduce $\textbf{TCOW}$, a new benchmark and model for visual tracking through heavy occlusion and containment. We set up a task where the goal is to, given a video sequence, segment both the projected extent of the target object, as well as the surrounding container or occluder whenever one exists. To study this task, we create a mixture of synthetic and annotated real datasets to support both supervised learning and structured evaluation of model performance under various forms of task variation, such as moving or nested containment. We evaluate two recent transformer-based video models and find that while they can be surprisingly capable of tracking targets under certain settings of task variation, there remains a considerable performance gap before we can claim a tracking model to have acquired a true notion of object permanence.
翻译:在复杂且动态的场景中持续跟踪物体仍是计算机视觉系统面临的严峻挑战。本文提出$\textbf{TCOW}$——一个用于重度遮挡及包裹场景下视觉跟踪的新基准与模型。我们设计了一项任务:给定视频序列,需同时分割目标物体的投影范围及其周围的容器或遮挡物(若存在)。为研究该任务,我们构建了合成数据与标注真实数据混合的数据集,以支持监督学习,并在多种任务变体(如移动或嵌套包裹)下对模型性能进行结构化评估。我们对两种基于Transformer的最新视频模型进行评测,发现尽管它们在某些任务变体设置下展现出令人意外的目标跟踪能力,但在宣称跟踪模型真正习得物体恒存性概念之前,仍存在显著的性能差距。