In recent years, online Video Instance Segmentation (VIS) methods have shown remarkable advancement with their powerful query-based detectors. Utilizing the output queries of the detector at the frame-level, these methods achieve high accuracy on challenging benchmarks. However, our observations demonstrate that these methods heavily rely on location information, which often causes incorrect associations between objects. This paper presents that a key axis of object matching in trackers is appearance information, which becomes greatly instructive under conditions where positional cues are insufficient for distinguishing their identities. Therefore, we suggest a simple yet powerful extension to object decoders that explicitly extract embeddings from backbone features and drive queries to capture the appearances of objects, which greatly enhances instance association accuracy. Furthermore, recognizing the limitations of existing benchmarks in fully evaluating appearance awareness, we have constructed a synthetic dataset to rigorously validate our method. By effectively resolving the over-reliance on location information, we achieve state-of-the-art results on YouTube-VIS 2019/2021 and Occluded VIS (OVIS). Code is available at https://github.com/KimHanjung/VISAGE.
翻译:近年来,在线视频实例分割方法凭借其强大的基于查询的检测器取得了显著进展。这些方法利用检测器在帧级别输出的查询,在挑战性基准上实现了高精度。然而,我们的观察表明,这些方法严重依赖位置信息,常常导致对象间的错误关联。本文指出,跟踪器中对象匹配的关键轴是外观信息,在位置线索不足以区分对象身份的条件下,外观信息具有极大的指导意义。因此,我们提出了一种简单而强大的对象解码器扩展方法,该方法从骨干网络特征中显式提取嵌入,并驱动查询捕获对象外观,从而大幅提升实例关联精度。此外,考虑到现有基准在全面评估外观感知能力方面的局限性,我们构建了一个合成数据集以严格验证我们的方法。通过有效解决对位置信息的过度依赖,我们在YouTube-VIS 2019/2021和Occluded VIS(OVIS)上取得了最先进的结果。代码已在https://github.com/KimHanjung/VISAGE 开源。