Human-Object Interaction (HOI) detection, which localizes and infers relationships between human and objects, plays an important role in scene understanding. Although two-stage HOI detectors have advantages of high efficiency in training and inference, they suffer from lower performance than one-stage methods due to the old backbone networks and the lack of considerations for the HOI perception process of humans in the interaction classifiers. In this paper, we propose Vision Transformer based Pose-Conditioned Self-Loop Graph (ViPLO) to resolve these problems. First, we propose a novel feature extraction method suitable for the Vision Transformer backbone, called masking with overlapped area (MOA) module. The MOA module utilizes the overlapped area between each patch and the given region in the attention function, which addresses the quantization problem when using the Vision Transformer backbone. In addition, we design a graph with a pose-conditioned self-loop structure, which updates the human node encoding with local features of human joints. This allows the classifier to focus on specific human joints to effectively identify the type of interaction, which is motivated by the human perception process for HOI. As a result, ViPLO achieves the state-of-the-art results on two public benchmarks, especially obtaining a +2.07 mAP performance gain on the HICO-DET dataset. The source codes are available at https://github.com/Jeeseung-Park/ViPLO.
翻译:人-物交互(HOI)检测旨在定位并推断人与物体之间的关系,在场景理解中发挥着重要作用。尽管两阶段HOI检测器在训练和推理效率方面具有优势,但由于骨干网络老旧且交互分类器缺乏对人类HOI感知过程的考虑,其性能往往低于单阶段方法。本文提出基于Vision Transformer的姿态条件自循环图(ViPLO)以解决这些问题。首先,我们提出一种适用于Vision Transformer骨干的新型特征提取方法,即重叠区域掩蔽(MOA)模块。该模块在注意力函数中利用每个图像块与给定区域之间的重叠区域,从而解决了使用Vision Transformer骨干时的量化问题。此外,我们设计了一种具有姿态条件自循环结构的图,通过人体关节的局部特征更新人体节点编码。受人类对HOI的感知过程启发,该机制使分类器能够聚焦于特定人体关节,从而有效识别交互类型。最终,ViPLO在两个公开基准上取得了最先进的结果,尤其在HICO-DET数据集上实现了+2.07 mAP的性能提升。源代码已公开于https://github.com/Jeeseung-Park/ViPLO。