Scalable Video Object Segmentation with Identification Mechanism

This paper delves into the challenges of achieving scalable and effective multi-object modeling for semi-supervised Video Object Segmentation (VOS). Previous VOS methods decode features with a single positive object, limiting the learning of multi-object representation as they must match and segment each target separately under multi-object scenarios. Additionally, earlier techniques catered to specific application objectives and lacked the flexibility to fulfill different speed-accuracy requirements. To address these problems, we present two innovative approaches, Associating Objects with Transformers (AOT) and Associating Objects with Scalable Transformers (AOST). In pursuing effective multi-object modeling, AOT introduces the IDentification (ID) mechanism to allocate each object a unique identity. This approach enables the network to model the associations among all objects simultaneously, thus facilitating the tracking and segmentation of objects in a single network pass. To address the challenge of inflexible deployment, AOST further integrates scalable long short-term transformers that incorporate layer-wise ID-based attention and scalable supervision. This overcomes ID embeddings' representation limitations and enables online architecture scalability in VOS for the first time. Given the absence of a benchmark for VOS involving densely multi-object annotations, we propose a challenging Video Object Segmentation in the Wild (VOSW) benchmark to validate our approaches. We evaluated various AOT and AOST variants using extensive experiments across VOSW and five commonly-used VOS benchmarks. Our approaches surpass the state-of-the-art competitors and display exceptional efficiency and scalability consistently across all six benchmarks. Moreover, we notably achieved the 1st position in the 3rd Large-scale Video Object Segmentation Challenge.

翻译：本文深入探讨了在面向半监督视频目标分割（VOS）中实现可扩展且有效的多目标建模所面临的挑战。现有VOS方法采用单一正目标解码特征，限制了多目标表示的学习，因为在多目标场景下，其必须分别匹配和分割每个目标。此外，早期技术服务于特定应用目标，缺乏满足不同速度-精度需求的灵活性。为解决这些问题，我们提出两种创新方法：基于Transformer的目标关联（AOT）与基于可扩展Transformer的目标关联（AOST）。在追求高效多目标建模的过程中，AOT引入身份识别（ID）机制，为每个目标分配唯一身份标识。该方法使网络能够同时建模所有目标间的关联，从而在单次网络传播中实现目标的跟踪与分割。为应对部署灵活性不足的挑战，AOST进一步集成了可扩展的长短时Transformer，其包含逐层基于ID的注意力机制与可扩展监督。这既突破了ID嵌入的表示能力限制，也首次实现了VOS中在线架构的可扩展性。鉴于当前缺乏密集多目标注释的VOS基准，我们提出具有挑战性的野外视频目标分割（VOSW）基准以验证所提方法。通过VOSW及五个常用VOS基准上的广泛实验，我们评估了多种AOT与AOST变体。我们的方法在所有六个基准上均超越现有最先进竞争者，并持续展现出卓越的效率与可扩展性。此外，我们还在第三届大规模视频目标分割挑战赛中荣获第一名。