We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection, which are complementary to each other by definition. Most of the previous works on multi-modal FSOD are fine-tuning-based which are inefficient for online applications. Moreover, these methods usually require expertise like class names to extract class semantic embedding, which are hard to get for rare classes. Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning to learn generalizable few-shot and zero-shot object detection models respectively without fine-tuning. Specifically, we combine the few-shot visual classifier and text classifier learned via meta-learning and prompt-based learning respectively to build the multi-modal classifier and detection models. In addition, to fully exploit the pre-trained language models, we propose meta-learning-based cross-modal prompting to generate soft prompts for novel classes present in few-shot visual examples, which are then used to learn the text classifier. Knowledge distillation is introduced to learn the soft prompt generator without using human prior knowledge of class names, which may not be available for rare classes. Our insight is that the few-shot support images naturally include related context information and semantics of the class. We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
翻译:本文研究多模态小样本目标检测(FSOD),利用少量视觉样本和类别语义信息进行检测,两者在定义上具有互补性。以往的多模态FSOD方法大多基于微调,难以适用于在线应用场景。此外,这些方法通常需要借助类别名称等专业知识来提取类别语义嵌入,这对稀有类别而言难以获取。受度量元学习与提示学习在高层次概念上的相似性启发,本文方法无需微调即可分别学习具有泛化能力的小样本和零样本目标检测模型。具体而言,我们将通过元学习获得的小样本视觉分类器与基于提示学习的文本分类器相结合,构建多模态分类器与检测模型。为充分利用预训练语言模型,我们提出基于元学习的跨模态提示生成方法,为小样本视觉示例中出现的新类别生成软提示,进而用于训练文本分类器。引入知识蒸馏技术无需使用人类关于类别名称的先验知识即可学习软提示生成器——这对稀有类别可能尤为重要。我们的核心洞察在于:小样本支持图像天然包含了类别的相关上下文信息与语义特征。我们在多个小样本目标检测基准上对提出的多模态FSOD模型进行了全面评估,取得了令人瞩目的实验结果。