Human-Object Interaction (HOI) detection aims to simultaneously localize human-object pairs and recognize their interactions. While recent two-stage approaches have made significant progress, they still face challenges due to incomplete context modeling. In this work, we introduce a Contextualized Representation Learning that integrates both affordance-guided reasoning and contextual prompts with visual cues to better capture complex interactions. We enhance the conventional HOI detection framework by expanding it beyond simple human-object pairs to include multivariate relationships involving auxiliary entities like tools. Specifically, we explicitly model the functional role (affordance) of these auxiliary objects through triplet structures <human, tool, object>. This enables our model to identify tool-dependent interactions such as 'filling'. Furthermore, the learnable prompt is enriched with instance categories and subsequently integrated with contextual visual features using an attention mechanism. This process aligns language with image content at both global and regional levels. These contextualized representations equip the model with enriched relational cues for more reliable reasoning over complex, context-dependent interactions. Our proposed method demonstrates superior performance on both the HICO-Det and V-COCO datasets in most scenarios. The source code is available at https://github.com/lzzhhh1019/CRL.
翻译:人-物交互检测旨在同时定位人物-物体对并识别其交互关系。尽管近期的两阶段方法已取得显著进展,但由于上下文建模不完整,它们仍面临挑战。本研究提出一种上下文表征学习方法,该方法将可供性引导推理和上下文提示与视觉线索相结合,以更好地捕捉复杂交互。我们通过将传统人-物交互检测框架从简单的人物-物体对扩展至包含工具等辅助实体的多元关系,从而增强该框架。具体而言,我们通过三元组结构<人物,工具,物体>显式建模这些辅助对象的功能角色(可供性)。这使得我们的模型能够识别如“填充”这类工具依赖型交互。此外,可学习提示通过实例类别进行增强,并随后通过注意力机制与上下文视觉特征融合。该过程在全局和区域层面实现了语言与图像内容的对齐。这些上下文化表征为模型提供了丰富的关联线索,使其能够对复杂且依赖上下文的交互进行更可靠的推理。我们提出的方法在多数场景下的HICO-Det和V-COCO数据集上均表现出优越性能。源代码发布于https://github.com/lzzhhh1019/CRL。