Human-Object Interaction Detection is a crucial aspect of human-centric scene understanding, with important applications in various domains. Despite recent progress in this field, recognizing subtle and detailed interactions remains challenging. Existing methods try to use human-related clues to alleviate the difficulty, but rely heavily on external annotations or knowledge, limiting their practical applicability in real-world scenarios. In this work, we propose a novel Part Semantic Network (PSN) to solve this problem. The core of PSN is a Conditional Part Attention (CPA) mechanism, where human features are taken as keys and values, and the object feature is used as query for the computation in a cross-attention mechanism. In this way, our model learns to automatically focus on the most informative human parts conditioned on the involved object, generating more semantically meaningful features for interaction recognition. Additionally, we propose an Occluded Part Extrapolation (OPE) strategy to facilitate interaction recognition under occluded scenarios, which teaches the model to extrapolate detailed features from partially occluded ones. Our method consistently outperforms prior approaches on the V-COCO and HICO-DET datasets, without external data or extra annotations. Additional ablation studies validate the effectiveness of each component of our proposed method.
翻译:人-物交互检测是以人为中心的场景理解的关键方面,在多个领域具有重要应用价值。尽管该领域近期取得了进展,但识别细微和详细的交互仍然具有挑战性。现有方法尝试利用与人体相关的线索来缓解这一难题,但过度依赖外部标注或知识,限制了其在实际场景中的实用性和应用范围。本文提出一种新颖的部位语义网络(PSN)来解决该问题。PSN的核心是条件部位注意力(CPA)机制:在该机制中,人体特征作为键和值,物体特征作为查询,通过交叉注意力机制进行计算。通过这种方式,模型能够自主学习关注与被交互物体相关的最具信息量的人体部位,生成更具语义意义的特征用于交互识别。此外,我们提出一种遮挡部位外推(OPE)策略来促进遮挡场景下的交互识别,该策略教会模型从部分遮挡的部位中推断出细节特征。在不借助外部数据或额外标注的情况下,本方法在V-COCO和HICO-DET数据集上持续优于现有方法。进一步的消融实验验证了所提方法每个组件的有效性。