Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions. Recently, Contrastive Language-Image Pre-training (CLIP) has shown great potential in providing interaction prior for HOI detectors via knowledge distillation. However, such approaches often rely on large-scale training data and suffer from inferior performance under few/zero-shot scenarios. In this paper, we propose a novel HOI detection framework that efficiently extracts prior knowledge from CLIP and achieves better generalization. In detail, we first introduce a novel interaction decoder to extract informative regions in the visual feature map of CLIP via a cross-attention mechanism, which is then fused with the detection backbone by a knowledge integration block for more accurate human-object pair detection. In addition, prior knowledge in CLIP text encoder is leveraged to generate a classifier by embedding HOI descriptions. To distinguish fine-grained interactions, we build a verb classifier from training data via visual semantic arithmetic and a lightweight verb representation adapter. Furthermore, we propose a training-free enhancement to exploit global HOI predictions from CLIP. Extensive experiments demonstrate that our method outperforms the state of the art by a large margin on various settings, e.g. +4.04 mAP on HICO-Det. The source code is available in https://github.com/Artanic30/HOICLIP.
翻译:人-物交互检测旨在定位人与物体配对并识别其交互行为。近年来,基于对比学习的视觉-语言预训练模型(CLIP)通过知识蒸馏技术为交互检测器提供了强大的交互先验信息。然而,此类方法通常依赖大规模训练数据,且在小样本/零样本场景下表现欠佳。本文提出一种新型人-物交互检测框架,该框架可高效提取CLIP中的先验知识,同时实现更优的泛化能力。具体而言,我们首先引入新型交互解码器,通过交叉注意力机制从CLIP视觉特征图中提取关键区域;随后利用知识集成模块将其与检测骨干网络融合,以提升人-物配对的检测精度。此外,通过嵌入交互描述文本,我们利用CLIP文本编码器的先验知识构建分类器。为区分细粒度交互,我们基于训练数据设计动词分类器,并采用视觉语义算术与轻量级动词表征适配器。最后,提出免训练增强策略以充分挖掘CLIP的全局交互预测能力。大量实验表明,本方法在多种设置下均显著超越现有最优方法,如在HICO-Det数据集上平均精度提升4.04%。源代码已开源至 https://github.com/Artanic30/HOICLIP。