We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection, which are complementary to each other by definition. Most of the previous works on multi-modal FSOD are fine-tuning-based which are inefficient for online applications. Moreover, these methods usually require expertise like class names to extract class semantic embedding, which are hard to get for rare classes. Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning to learn generalizable few-shot and zero-shot object detection models respectively without fine-tuning. Specifically, we combine the few-shot visual classifier and text classifier learned via meta-learning and prompt-based learning respectively to build the multi-modal classifier and detection models. In addition, to fully exploit the pre-trained language models, we propose meta-learning-based cross-modal prompting to generate soft prompts for novel classes present in few-shot visual examples, which are then used to learn the text classifier. Knowledge distillation is introduced to learn the soft prompt generator without using human prior knowledge of class names, which may not be available for rare classes. Our insight is that the few-shot support images naturally include related context information and semantics of the class. We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
翻译:本文研究多模态小样本目标检测(FSOD),利用互补的小样本视觉示例和类别语义信息进行检测。此前多数多模态FSOD工作基于微调方法,在在线应用中效率低下。此外,这些方法通常需要类别名称等专业知识来提取类别语义嵌入,对于罕见类别难以获得。受(基于度量的)元学习与基于提示的学习在概念层面高度相似性的启发,本文提出无需微调即可分别训练具有泛化能力的小样本和零样本目标检测模型。具体而言,我们将通过元学习与基于提示的学习分别训练的小样本视觉分类器和文本分类器相结合,构建多模态分类器与检测模型。同时,为充分利用预训练语言模型,我们提出基于元学习的跨模态提示生成方法,为小样本视觉示例中的新类别生成软提示,进而用于训练文本分类器。引入知识蒸馏技术学习软提示生成器,避免使用类别名称的人类先验知识(该知识对罕见类别可能不可用)。我们的核心洞见在于:小样本支持图像天然包含类别的相关上下文信息与语义。我们在多个小样本目标检测基准上全面评估了所提出的多模态FSOD模型,取得了令人满意的结果。