Open-vocabulary object detection aims to provide object detectors trained on a fixed set of object categories with the generalizability to detect objects described by arbitrary text queries. Previous methods adopt knowledge distillation to extract knowledge from Pretrained Vision-and-Language Models (PVLMs) and transfer it to detectors. However, due to the non-adaptive proposal cropping and single-level feature mimicking processes, they suffer from information destruction during knowledge extraction and inefficient knowledge transfer. To remedy these limitations, we propose an Object-Aware Distillation Pyramid (OADP) framework, including an Object-Aware Knowledge Extraction (OAKE) module and a Distillation Pyramid (DP) mechanism. When extracting object knowledge from PVLMs, the former adaptively transforms object proposals and adopts object-aware mask attention to obtain precise and complete knowledge of objects. The latter introduces global and block distillation for more comprehensive knowledge transfer to compensate for the missing relation information in object distillation. Extensive experiments show that our method achieves significant improvement compared to current methods. Especially on the MS-COCO dataset, our OADP framework reaches $35.6$ mAP$^{\text{N}}_{50}$, surpassing the current state-of-the-art method by $3.3$ mAP$^{\text{N}}_{50}$. Code is released at https://github.com/LutingWang/OADP.
翻译:开源词汇目标检测旨在为在固定目标类别集上训练的目标检测器提供泛化能力,使其能够检测由任意文本查询描述的目标。先前方法采用知识蒸馏从预训练视觉-语言模型(PVLMs)中提取知识并将其迁移至检测器。然而,由于非自适应的候选框裁剪和单层特征模仿过程,这些方法在知识提取过程中存在信息破坏问题,且知识迁移效率低下。为解决上述局限,我们提出面向目标的知识蒸馏金字塔(OADP)框架,包含面向目标的知识提取(OAKE)模块和蒸馏金字塔(DP)机制。在从PVLMs提取目标知识时,前者自适应转换候选框并采用面向目标的掩码注意力机制,以获取精确且完整的目标知识;后者引入全局蒸馏和块蒸馏进行更全面的知识迁移,以补偿目标蒸馏中缺失的关系信息。大量实验表明,与当前方法相比,我们的方法取得了显著提升。特别是在MS-COCO数据集上,OADP框架达到$35.6$ mAP$^{\text{N}}_{50}$,超越当前最优方法$3.3$ mAP$^{\text{N}}_{50}$。代码已开源至https://github.com/LutingWang/OADP。