Recently, the DETR framework has emerged as the dominant approach for human--object interaction (HOI) research. In particular, two-stage transformer-based HOI detectors are amongst the most performant and training-efficient approaches. However, these often condition HOI classification on object features that lack fine-grained contextual information, eschewing pose and orientation information in favour of visual cues about object identity and box extremities. This naturally hinders the recognition of complex or ambiguous interactions. In this work, we study these issues through visualisations and carefully designed experiments. Accordingly, we investigate how best to re-introduce image features via cross-attention. With an improved query design, extensive exploration of keys and values, and box pair positional embeddings as spatial guidance, our model with enhanced predicate visual context (PViC) outperforms state-of-the-art methods on the HICO-DET and V-COCO benchmarks, while maintaining low training cost.
翻译:近期,DETR框架已成为人体-物体交互(HOI)研究的主流方法。其中,基于Transformer的两阶段HOI检测器在性能与训练效率方面表现尤为突出。然而,这类方法通常将HOI分类建立在缺乏细粒度上下文信息的物体特征上,以物体身份与边界框极值等视觉线索替代姿态与朝向信息。这自然阻碍了对复杂或模糊交互的识别。本研究通过可视化手段与精心设计的实验系统探讨了上述问题,并据此研究如何通过交叉注意力机制最优地重新引入图像特征。通过改进查询设计、对键值与值的广泛探索,以及将边界框对位置嵌入作为空间引导,我们的增强谓词视觉上下文模型(PViC)在HICO-DET与V-COCO基准上超越了现有最优方法,同时保持了较低的训练成本。