Human object interaction (HOI) detection plays a crucial role in human-centric scene understanding and serves as a fundamental building-block for many vision tasks. One generalizable and scalable strategy for HOI detection is to use weak supervision, learning from image-level annotations only. This is inherently challenging due to ambiguous human-object associations, large search space of detecting HOIs and highly noisy training signal. A promising strategy to address those challenges is to exploit knowledge from large-scale pretrained models (e.g., CLIP), but a direct knowledge distillation strategy~\citep{liao2022gen} does not perform well on the weakly-supervised setting. In contrast, we develop a CLIP-guided HOI representation capable of incorporating the prior knowledge at both image level and HOI instance level, and adopt a self-taught mechanism to prune incorrect human-object associations. Experimental results on HICO-DET and V-COCO show that our method outperforms the previous works by a sizable margin, showing the efficacy of our HOI representation.
翻译:人类-物体交互(HOI)检测在以人为中心的场景理解中起着关键作用,并且是许多视觉任务的基础组成部分。一种具有泛化性和可扩展性的HOI检测策略是利用弱监督学习,仅从图像级标注中学习。由于人类-物体关联的模糊性、HOI检测的大搜索空间以及高度噪声的训练信号,这一任务本身具有挑战性。解决这些挑战的一个有前景的策略是利用大规模预训练模型(例如CLIP)的知识,但直接的蒸馏学习策略\citep{liao2022gen}在弱监督设置下表现不佳。相比之下,我们开发了一种CLIP引导的HOI表示,能够在图像级和HOI实例级两个层面融入先验知识,并采用自教会机制修剪不正确的物体-人类关联。在HICO-DET和V-COCO上的实验结果表明,我们的方法以显著优势超越了先前的工作,证明了我们HOI表示的有效性。