Tracking and segmenting multiple similar objects with complex or separate parts in long-term videos is inherently challenging due to the ambiguity of target parts and identity confusion caused by occlusion, background clutter, and long-term variations. In this paper, we propose a robust video object segmentation framework equipped with spatial-semantic features and discriminative object queries to address the above issues. Specifically, we construct a spatial-semantic network comprising a semantic embedding block and spatial dependencies modeling block to associate the pretrained ViT features with global semantic features and local spatial features, providing a comprehensive target representation. In addition, we develop a masked cross-attention module to generate object queries that focus on the most discriminative parts of target objects during query propagation, alleviating noise accumulation and ensuring effective long-term query propagation. The experimental results show that the proposed method set a new state-of-the-art performance on multiple datasets, including the DAVIS2017 test (89.1%), YoutubeVOS 2019 (88.5%), MOSE (75.1%), LVOS test (73.0%), and LVOS val (75.1%), which demonstrate the effectiveness and generalization capacity of the proposed method. We will make all source code and trained models publicly available.
翻译:在长期视频中跟踪和分割具有复杂或分离部件的多个相似目标,由于目标部件的模糊性以及遮挡、背景杂乱和长期变化引起的身份混淆,本质上具有挑战性。本文提出了一种鲁棒的视频目标分割框架,该框架配备了空间语义特征和判别性目标查询以解决上述问题。具体而言,我们构建了一个包含语义嵌入块和空间依赖关系建模块的空间语义网络,将预训练的ViT特征与全局语义特征和局部空间特征相关联,从而提供全面的目标表示。此外,我们开发了一个掩码交叉注意力模块,用于在查询传播过程中生成关注目标对象最具判别性部分的物体查询,从而减轻噪声积累并确保有效的长期查询传播。实验结果表明,所提出的方法在多个数据集上取得了新的最先进性能,包括DAVIS2017测试集(89.1%)、YoutubeVOS 2019(88.5%)、MOSE(75.1%)、LVOS测试集(73.0%)和LVOS验证集(75.1%),这证明了所提方法的有效性和泛化能力。我们将公开所有源代码和训练好的模型。