Human-object interaction (HOI) detection aims to extract interacting human-object pairs and their interaction categories from a given natural image. Even though the labeling effort required for building HOI detection datasets is inherently more extensive than for many other computer vision tasks, weakly-supervised directions in this area have not been sufficiently explored due to the difficulty of learning human-object interactions with weak supervision, rooted in the combinatorial nature of interactions over the object and predicate space. In this paper, we tackle HOI detection with the weakest supervision setting in the literature, using only image-level interaction labels, with the help of a pretrained vision-language model (VLM) and a large language model (LLM). We first propose an approach to prune non-interacting human and object proposals to increase the quality of positive pairs within the bag, exploiting the grounding capability of the vision-language model. Second, we use a large language model to query which interactions are possible between a human and a given object category, in order to force the model not to put emphasis on unlikely interactions. Lastly, we use an auxiliary weakly-supervised preposition prediction task to make our model explicitly reason about space. Extensive experiments and ablations show that all of our contributions increase HOI detection performance.
翻译:人-物交互(HOI)检测旨在从给定自然图像中提取相互交互的人-物对及其交互类别。尽管构建HOI检测数据集所需的标注工作量本质上比许多其他计算机视觉任务更为庞大,但由于交互在物体和谓词空间中的组合性质导致弱监督学习人-物交互的困难,该领域的弱监督方向尚未得到充分探索。本文针对文献中最弱监督设置下的HOI检测问题,仅利用图像级交互标签,借助预训练视觉-语言模型(VLM)和大语言模型(LLM)展开研究。我们首先提出一种方法,利用视觉-语言模型的定位能力修剪非交互的人与物体候选框,以提高包中正样本对的质量。其次,我们使用大语言模型查询人与特定物体类别之间可能发生的交互,以强制模型不将注意力集中在不可能的交互上。最后,我们引入一个辅助的弱监督介词预测任务,使模型显式推理空间关系。大量实验和消融研究表明,我们提出的所有改进方法均能提升HOI检测性能。