Scalable Video Object Segmentation with Identification Mechanism

This paper delves into the challenges of achieving scalable and effective multi-object modeling for semi-supervised Video Object Segmentation (VOS). Previous VOS methods decode features with a single positive object, limiting the learning of multi-object representation as they must match and segment each target separately under multi-object scenarios. Additionally, earlier techniques catered to specific application objectives and lacked the flexibility to fulfill different speed-accuracy requirements. To address these problems, we present two innovative approaches, Associating Objects with Transformers (AOT) and Associating Objects with Scalable Transformers (AOST). In pursuing effective multi-object modeling, AOT introduces the IDentification (ID) mechanism to allocate each object a unique identity. This approach enables the network to model the associations among all objects simultaneously, thus facilitating the tracking and segmentation of objects in a single network pass. To address the challenge of inflexible deployment, AOST further integrates scalable long short-term transformers that incorporate scalable supervision and layer-wise ID-based attention. This enables online architecture scalability in VOS for the first time and overcomes ID embeddings' representation limitations. Given the absence of a benchmark for VOS involving densely multi-object annotations, we propose a challenging Video Object Segmentation in the Wild (VOSW) benchmark to validate our approaches. We evaluated various AOT and AOST variants using extensive experiments across VOSW and five commonly used VOS benchmarks, including YouTube-VOS 2018 & 2019 Val, DAVIS-2017 Val & Test, and DAVIS-2016. Our approaches surpass the state-of-the-art competitors and display exceptional efficiency and scalability consistently across all six benchmarks. Project page: https://github.com/yoxu515/aot-benchmark.

翻译：本文深入探讨了在可扩展且高效的多目标建模中实现半监督视频对象分割（VOS）所面临的挑战。现有VOS方法通过单一正目标解码特征，限制了多目标表示的学习能力，因为在多目标场景下，它们必须分别匹配和分割每个目标。此外，早期技术仅针对特定应用目标，缺乏满足不同速度-精度需求的灵活性。为解决这些问题，我们提出了两种创新方法：基于Transformer的目标关联（AOT）和基于可扩展Transformer的目标关联（AOST）。在追求高效多目标建模时，AOT引入了身份标识（ID）机制，为每个对象分配唯一标识。该方法使网络能够同时建模所有对象间的关联，从而在单次网络前向传播中实现目标的跟踪与分割。为解决部署灵活性不足的问题，AOST进一步集成了可扩展长短期Transformer，该架构融合可扩展监督机制和逐层基于ID的注意力机制。这首次在VOS中实现了在线架构可扩展性，并突破了ID嵌入的表示限制。鉴于缺乏密集多目标标注的VOS基准数据集，我们提出具有挑战性的野外视频对象分割（VOSW）基准来验证所提方法。通过VOSW及五个常用VOS基准（包括YouTube-VOS 2018 & 2019验证集、DAVIS-2017验证集与测试集、DAVIS-2016）的广泛实验，我们评估了多种AOT和AOST变体。在所有六个基准上，我们的方法均超越现有最优竞争者，并展现出持续优异的效率与可扩展性。项目页面：https://github.com/yoxu515/aot-benchmark。