We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models. F-VLM simplifies the current multi-stage training pipeline by eliminating the need for knowledge distillation or detection-tailored pretraining. Surprisingly, we observe that a frozen VLM: 1) retains the locality-sensitive features necessary for detection, and 2) is a strong region classifier. We finetune only the detector head and combine the detector and VLM outputs for each region at inference time. F-VLM shows compelling scaling behavior and achieves +6.5 mask AP improvement over the previous state of the art on novel categories of LVIS open-vocabulary detection benchmark. In addition, we demonstrate very competitive results on COCO open-vocabulary detection benchmark and cross-dataset transfer detection, in addition to significant training speed-up and compute savings. Code will be released at the https://sites.google.com/view/f-vlm/home
翻译:我们提出F-VLM,一种基于冻结视觉与语言模型的简单开放词汇目标检测方法。F-VLM通过消除知识蒸馏或检测专用预训练的需求,简化了当前多阶段训练流程。令人惊讶的是,我们观察到冻结的VLM能够:1)保留检测所需的局部敏感特征,2)成为强大的区域分类器。我们仅微调解码器头部,并在推理阶段将解码器与VLM的输出进行区域级融合。F-VLM展现出显著的扩展性能,在LVIS开放词汇检测基准的新颖类别上,相比先前最先进方法实现了+6.5的掩码AP提升。此外,我们在COCO开放词汇检测基准和跨数据集迁移检测中获得了极具竞争力的结果,同时实现了显著的训练加速和计算资源节省。代码将于https://sites.google.com/view/f-vlm/home发布。