In recent years, online Video Instance Segmentation (VIS) methods have shown remarkable advancement with their powerful query-based detectors. Utilizing the output queries of the detector at the frame level, these methods achieve high accuracy on challenging benchmarks. However, we observe the heavy reliance of these methods on the location information that leads to incorrect matching when positional cues are insufficient for resolving ambiguities. Addressing this issue, we present VISAGE that enhances instance association by explicitly leveraging appearance information. Our method involves a generation of queries that embed appearances from backbone feature maps, which in turn get used in our suggested simple tracker for robust associations. Finally, enabling accurate matching in complex scenarios by resolving the issue of over-reliance on location information, we achieve competitive performance on multiple VIS benchmarks. For instance, on YTVIS19 and YTVIS21, our method achieves 54.5 AP and 50.8 AP. Furthermore, to highlight appearance-awareness not fully addressed by existing benchmarks, we generate a synthetic dataset where our method outperforms others significantly by leveraging the appearance cue. Code will be made available at https://github.com/KimHanjung/VISAGE.
翻译:近年来,基于强大查询检测器的在线视频实例分割(VIS)方法取得了显著进展。这些方法利用检测器在帧级别输出的查询,在具有挑战性的基准上实现了高精度。然而,我们观察到这些方法过度依赖位置信息,当位置线索不足以解决歧义时,会导致匹配错误。针对这一问题,我们提出VISAGE,通过显式利用外观信息来增强实例关联。该方法生成嵌入主干特征图中外观信息的查询,并将其用于我们提出的简单跟踪器以实现鲁棒的关联。最终,通过解决对位置信息的过度依赖问题,我们在复杂场景中实现了精确匹配,并在多个VIS基准上取得了具有竞争力的性能。例如,在YTVIS19和YTVIS21数据集上,我们的方法分别达到54.5 AP和50.8 AP。此外,为突出现有基准尚未充分解决的外观感知问题,我们生成了一个合成数据集,在该数据集上,我们的方法通过利用外观线索显著优于其他方法。代码将开源在https://github.com/KimHanjung/VISAGE。