Human-Object Interaction (HOI) detection aims to localize human-object pairs and comprehend their interactions. Recently, two-stage transformer-based methods have demonstrated competitive performance. However, these methods frequently focus on object appearance features and ignore global contextual information. Besides, vision-language model CLIP which effectively aligns visual and text embeddings has shown great potential in zero-shot HOI detection. Based on the former facts, We introduce a novel HOI detector named ISA-HOI, which extensively leverages knowledge from CLIP, aligning interactive semantics between visual and textual features. We first extract global context of image and local features of object to Improve interaction Features in images (IF). On the other hand, we propose a Verb Semantic Improvement (VSI) module to enhance textual features of verb labels via cross-modal fusion. Ultimately, our method achieves competitive results on the HICO-DET and V-COCO benchmarks with much fewer training epochs, and outperforms the state-of-the-art under zero-shot settings.
翻译:人-物交互(HOI)检测旨在定位人-物对并理解其交互关系。近年来,基于两阶段Transformer的方法展现出优异的性能。然而,此类方法往往聚焦于物体外观特征,忽略了全局上下文信息。此外,能够有效对齐视觉与文本嵌入的视觉语言模型CLIP,已在零样本HOI检测中展现出巨大潜力。基于上述事实,本文提出一种名为ISA-HOI的新型HOI检测器,该检测器充分利用CLIP的知识,对齐视觉与文本特征间的交互语义。我们首先提取图像的全局上下文与物体的局部特征,以改进图像中交互特征(IF)。另一方面,我们提出动词语义增强(VSI)模块,通过跨模态融合增强动词标签的文本特征。最终,我们的方法在HICO-DET与V-COCO基准测试中以更少的训练轮次取得了具有竞争力的结果,并在零样本设置下超越了现有最优方法。