We introduce MQ-Det, an efficient architecture and pre-training strategy design to utilize both textual description with open-set generalization and visual exemplars with rich description granularity as category queries, namely, Multi-modal Queried object Detection, for real-world detection with both open-vocabulary categories and various granularity. MQ-Det incorporates vision queries into existing well-established language-queried-only detectors. A plug-and-play gated class-scalable perceiver module upon the frozen detector is proposed to augment category text with class-wise visual information. To address the learning inertia problem brought by the frozen detector, a vision conditioned masked language prediction strategy is proposed. MQ-Det's simple yet effective architecture and training strategy design is compatible with most language-queried object detectors, thus yielding versatile applications. Experimental results demonstrate that multi-modal queries largely boost open-world detection. For instance, MQ-Det significantly improves the state-of-the-art open-set detector GLIP by +7.8% zero-shot AP on the LVIS benchmark and averagely +6.3% AP on 13 few-shot downstream tasks, with merely 3% pre-training time required by GLIP. Code is available at https://github.com/YifanXu74/MQ-Det.
翻译:我们提出MQ-Det——一种高效的架构与预训练策略,旨在同时利用具备开放集泛化能力的文本描述和具有丰富描述粒度的视觉样本作为类别查询,即多模态查询目标检测,以应对真实场景中兼具开放词汇类别与多种粒度的检测需求。MQ-Det将视觉查询融入现有成熟的语言查询检测器中,提出一种即插即用的门控类可扩展感知器模块,该模块通过类别级视觉信息增强类别文本表示。针对冻结检测器带来的学习惯性问题,提出视觉条件掩码语言预测策略。MQ-Det简洁高效的架构与训练策略可与大多数语言查询目标检测器兼容,从而支持多样化应用。实验结果表明,多模态查询大幅提升了开放世界检测性能。例如,在LVIS基准上,MQ-Det将当前最优开放集检测器GLIP的零样本AP提升7.8%,并在13个少样本下游任务中平均提升6.3% AP,而预训练时间仅为GLIP的3%。代码开源地址:https://github.com/YifanXu74/MQ-Det。