In contrast to the incremental classification task, the incremental detection task is characterized by the presence of data ambiguity, as an image may have differently labeled bounding boxes across multiple continuous learning stages. This phenomenon often impairs the model's ability to effectively learn new classes. However, existing research has paid less attention to the forward compatibility of the model, which limits its suitability for incremental learning. To overcome this obstacle, we propose leveraging a visual-language model such as CLIP to generate text feature embeddings for different class sets, which enhances the feature space globally. We then employ super-classes to replace the unavailable novel classes in the early learning stage to simulate the incremental scenario. Finally, we utilize the CLIP image encoder to accurately identify potential objects. We incorporate the finely recognized detection boxes as pseudo-annotations into the training process, thereby further improving the detection performance. We evaluate our approach on various incremental learning settings using the PASCAL VOC 2007 dataset, and our approach outperforms state-of-the-art methods, particularly for recognizing the new classes.
翻译:与增量分类任务不同,增量检测任务具有数据模糊性的特点,因为同一图像在多个连续学习阶段可能包含不同标注的边界框。这种现象通常会损害模型有效学习新类别的能力。然而,现有研究较少关注模型的前向兼容性,这限制了其在增量学习场景中的适用性。为克服这一障碍,我们提出利用视觉-语言模型(如CLIP)为不同类别集合生成文本特征嵌入,从而在全局层面增强特征空间。我们采用超类别替代早期学习阶段不可用的新类别,以模拟增量学习场景。最后,我们利用CLIP图像编码器精准识别潜在目标。通过将精细识别的检测框作为伪标注纳入训练过程,进一步提升了检测性能。我们在PASCAL VOC 2007数据集上评估了多种增量学习设置,结果表明我们的方法优于现有先进技术,尤其在新类别的识别方面表现突出。