Recent human-object interaction (HOI) detection methods depend on extensively annotated image datasets, which require a significant amount of manpower. In this paper, we propose a novel self-adaptive, language-driven HOI detection method, termed FreeA. This method leverages the adaptability of the text-image model to generate latent HOI labels without requiring manual annotation. Specifically, FreeA aligns image features of human-object pairs with HOI text templates and employs a knowledge-based masking technique to decrease improbable interactions. Furthermore, FreeA implements a proposed method for matching interaction correlations to increase the probability of actions associated with a particular action, thereby improving the generated HOI labels. Experiments on two benchmark datasets showcase that FreeA achieves state-of-the-art performance among weakly supervised HOI competitors. Our proposal gets +\textbf{13.29} (\textbf{159\%$\uparrow$}) mAP and +\textbf{17.30} (\textbf{98\%$\uparrow$}) mAP than the newest ``Weakly'' supervised model, and +\textbf{7.19} (\textbf{28\%$\uparrow$}) mAP and +\textbf{14.69} (\textbf{34\%$\uparrow$}) mAP than the latest ``Weakly+'' supervised model, respectively, on HICO-DET and V-COCO datasets, more accurate in localizing and classifying the interactive actions. The source code will be made public.
翻译:近期的人-物交互检测方法依赖于大量标注的图像数据集,这需要耗费大量人力。本文提出一种新颖的自适应、语言驱动的人-物交互检测方法,称为FreeA。该方法利用图文模型的适应性,无需人工标注即可生成潜在的人-物交互标签。具体而言,FreeA将人-物对的图像特征与人-物交互文本模板对齐,并采用基于知识的掩码技术以降低不可能交互的概率。此外,FreeA实现了一种交互相关性匹配方法,以提高与特定动作相关联的动作概率,从而改善生成的人-物交互标签。在两个基准数据集上的实验表明,FreeA在弱监督人-物交互检测方法中取得了最先进的性能。在HICO-DET和V-COCO数据集上,我们的方法相较于最新的“弱监督”模型分别提升了+\textbf{13.29} (\textbf{159\%$\uparrow$}) mAP和+\textbf{17.30} (\textbf{98\%$\uparrow$}) mAP,相较于最新的“弱监督+”模型分别提升了+\textbf{7.19} (\textbf{28\%$\uparrow$}) mAP和+\textbf{14.69} (\textbf{34\%$\uparrow$}) mAP,在交互动作的定位与分类上更为准确。源代码将公开。