In this paper, we introduce the Context-Aware Video Instance Segmentation (CAVIS), a novel framework designed to enhance instance association by integrating contextual information adjacent to each object. To efficiently extract and leverage this information, we propose the Context-Aware Instance Tracker (CAIT), which merges contextual data surrounding the instances with the core instance features to improve tracking accuracy. Additionally, we introduce the Prototypical Cross-frame Contrastive (PCC) loss, which ensures consistency in object-level features across frames, thereby significantly enhancing instance matching accuracy. CAVIS demonstrates superior performance over state-of-the-art methods on all benchmark datasets in video instance segmentation (VIS) and video panoptic segmentation (VPS). Notably, our method excels on the OVIS dataset, which is known for its particularly challenging videos.
翻译:本文提出了一种新颖的上下文感知视频实例分割框架,旨在通过整合每个物体相邻的上下文信息来增强实例关联。为有效提取并利用此类信息,我们设计了上下文感知实例跟踪器,该模块将实例周围的上下文数据与核心实例特征相融合,从而提升跟踪精度。此外,我们引入了原型跨帧对比损失函数,该损失确保跨帧物体级特征的一致性,显著提高了实例匹配的准确率。在视频实例分割和视频全景分割的所有基准数据集上,本方法均展现出优于现有先进技术的性能。特别值得注意的是,我们的方法在包含极具挑战性视频序列的OVIS数据集上表现尤为突出。