In this paper, we present offline-to-online knowledge distillation (OOKD) for video instance segmentation (VIS), which transfers a wealth of video knowledge from an offline model to an online model for consistent prediction. Unlike previous methods that having adopting either an online or offline model, our single online model takes advantage of both models by distilling offline knowledge. To transfer knowledge correctly, we propose query filtering and association (QFA), which filters irrelevant queries to exact instances. Our KD with QFA increases the robustness of feature matching by encoding object-centric features from a single frame supplemented by long-range global information. We also propose a simple data augmentation scheme for knowledge distillation in the VIS task that fairly transfers the knowledge of all classes into the online model. Extensive experiments show that our method significantly improves the performance in video instance segmentation, especially for challenging datasets including long, dynamic sequences. Our method also achieves state-of-the-art performance on YTVIS-21, YTVIS-22, and OVIS datasets, with mAP scores of 46.1%, 43.6%, and 31.1%, respectively.
翻译:本文提出面向视频实例分割(VIS)的离线到在线知识蒸馏(OOKD)方法,旨在将离线模型丰富的视频知识迁移至在线模型,实现一致性预测。与以往仅采用在线或离线模型的方法不同,本文的单一在线模型通过蒸馏离线知识同时融合两类模型的优势。为正确迁移知识,我们提出查询过滤与关联(QFA)机制,滤除无关查询以精确定位实例。结合QFA的知识蒸馏方法通过编码补充了远程全局信息的单帧对象中心特征,增强了特征匹配的鲁棒性。我们还为VIS任务中的知识蒸馏设计了一种简单的数据增强方案,确保所有类别的知识都能公平地迁移至在线模型。大量实验表明,本方法显著提升了视频实例分割性能,尤其适用于包含长动态序列的挑战性数据集。本方法在YTVIS-21、YTVIS-22与OVIS数据集上均达到最优性能,mAP分数分别为46.1%、43.6%和31.1%。