In the incremental detection task, unlike the incremental classification task, data ambiguity exists due to the possibility of an image having different labeled bounding boxes in multiple continuous learning stages. This phenomenon often impairs the model's ability to learn new classes. However, the forward compatibility of the model is less considered in existing work, which hinders the model's suitability for incremental learning. To overcome this obstacle, we propose to use a language-visual model such as CLIP to generate text feature embeddings for different class sets, which enhances the feature space globally. We then employ the broad classes to replace the unavailable novel classes in the early learning stage to simulate the actual incremental scenario. Finally, we use the CLIP image encoder to identify potential objects in the proposals, which are classified into the background by the model. We modify the background labels of those proposals to known classes and add the boxes to the training set to alleviate the problem of data ambiguity. We evaluate our approach on various incremental learning settings on the PASCAL VOC 2007 dataset, and our approach outperforms state-of-the-art methods, particularly for the new classes.
翻译:在增量式检测任务中,与增量式分类任务不同,由于同一图像在多个连续学习阶段可能具有不同标注的边界框,因此存在数据歧义性问题。这一现象通常会影响模型学习新类别的能力。然而,现有工作较少考虑模型的前向兼容性,这阻碍了模型对增量学习的适应性。为解决这一难题,我们提出利用CLIP等语言-视觉模型为不同类别集生成文本特征嵌入,从而全局增强特征空间。随后,在早期学习阶段采用广义类别替代不可用的新类别,以模拟真实的增量场景。最后,利用CLIP图像编码器识别建议框中的潜在目标,这些目标会被模型归类为背景。我们将这些建议框的背景标签修改为已知类别,并将其边界框加入训练集,以缓解数据歧义性问题。我们在PASCAL VOC 2007数据集上评估了该方法在多种增量学习设置下的表现,结果表明我们的方法优于现有最先进技术,尤其在新类别识别方面优势显著。