The handling of long videos with complex and occluded sequences has recently emerged as a new challenge in the video instance segmentation (VIS) community. However, existing methods have limitations in addressing this challenge. We argue that the biggest bottleneck in current approaches is the discrepancy between training and inference. To effectively bridge this gap, we propose a Generalized framework for VIS, namely GenVIS, that achieves state-of-the-art performance on challenging benchmarks without designing complicated architectures or requiring extra post-processing. The key contribution of GenVIS is the learning strategy, which includes a query-based training pipeline for sequential learning with a novel target label assignment. Additionally, we introduce a memory that effectively acquires information from previous states. Thanks to the new perspective, which focuses on building relationships between separate frames or clips, GenVIS can be flexibly executed in both online and semi-online manner. We evaluate our approach on popular VIS benchmarks, achieving state-of-the-art results on YouTube-VIS 2019/2021/2022 and Occluded VIS (OVIS). Notably, we greatly outperform the state-of-the-art on the long VIS benchmark (OVIS), improving 5.6 AP with ResNet-50 backbone. Code is available at https://github.com/miranheo/GenVIS.
翻译:处理包含复杂遮挡序列的长视频近期成为视频实例分割领域的一项新挑战,然而现有方法在应对这一挑战时存在局限性。我们认为当前方法的最大瓶颈在于训练与推理之间的差异。为有效弥合这一差距,我们提出了一个通用视频实例分割框架GenVIS,该框架在不设计复杂架构或需要额外后处理的情况下,在具有挑战性的基准测试中达到了最先进的性能。GenVIS的核心贡献在于其学习策略,包括基于查询的时序学习训练流程以及新颖的目标标签分配机制。此外,我们引入了一种能够有效获取先前状态信息的记忆模块。得益于这种聚焦于建立独立帧或剪辑片段间关系的新视角,GenVIS可灵活地以在线和半在线方式执行。我们在主流VIS基准上评估了该方法,在YouTube-VIS 2019/2021/2022和Occluded VIS(OVIS)上均取得了最先进的结果。值得注意的是,我们在长视频VIS基准(OVIS)上显著超越现有最优方法,使用ResNet-50骨干网络时提升幅度达5.6 AP。代码开源地址:https://github.com/miranheo/GenVIS。