Human-Object Interaction (HOI) detection plays a vital role in scene understanding, which aims to predict the HOI triplet in the form of <human, object, action>. Existing methods mainly extract multi-modal features (e.g., appearance, object semantics, human pose) and then fuse them together to directly predict HOI triplets. However, most of these methods focus on seeking for self-triplet aggregation, but ignore the potential cross-triplet dependencies, resulting in ambiguity of action prediction. In this work, we propose to explore Self- and Cross-Triplet Correlations (SCTC) for HOI detection. Specifically, we regard each triplet proposal as a graph where Human, Object represent nodes and Action indicates edge, to aggregate self-triplet correlation. Also, we try to explore cross-triplet dependencies by jointly considering instance-level, semantic-level, and layout-level relations. Besides, we leverage the CLIP model to assist our SCTC obtain interaction-aware feature by knowledge distillation, which provides useful action clues for HOI detection. Extensive experiments on HICO-DET and V-COCO datasets verify the effectiveness of our proposed SCTC.
翻译:人-物体交互(HOI)检测在场景理解中扮演着关键角色,其目标是以<人,物体,动作>的形式预测HOI三元组。现有方法主要提取多模态特征(如外观、物体语义、人体姿态)并将其融合以直接预测HOI三元组。然而,多数方法聚焦于自三元组聚合,忽视了潜在的跨三元组依赖关系,从而导致动作预测的模糊性。本文提出探索自三元组与跨三元组相关性(SCTC)用于HOI检测。具体而言,我们将每个三元组提议视为一个图,其中人与物体表示节点,动作表示边,以聚合自三元组间的相关性;同时,通过联合考虑实例级、语义级和布局级关系,尝试探索跨三元组依赖。此外,我们借助CLIP模型通过知识蒸馏辅助SCTC获取感知交互的特征,为HOI检测提供有效的动作线索。在HICO-DET和V-COCO数据集上的大量实验验证了所提SCTC的有效性。