Recently, the DETR framework has emerged as the dominant approach for human--object interaction (HOI) research. In particular, two-stage transformer-based HOI detectors are amongst the most performant and training-efficient approaches. However, these often condition HOI classification on object features that lack fine-grained contextual information, eschewing pose and orientation information in favour of visual cues about object identity and box extremities. This naturally hinders the recognition of complex or ambiguous interactions. In this work, we study these issues through visualisations and carefully designed experiments. Accordingly, we investigate how best to re-introduce image features via cross-attention. With an improved query design, extensive exploration of keys and values, and box pair positional embeddings as spatial guidance, our model with enhanced predicate visual context (PViC) outperforms state-of-the-art methods on the HICO-DET and V-COCO benchmarks, while maintaining low training cost.
翻译:近期,DETR框架已成为人-物交互(HOI)研究的主流方法。其中,两阶段基于Transformer的HOI检测器在性能和训练效率方面表现尤为突出。然而,这些方法通常将HOI分类建立在缺乏细粒度上下文信息的目标特征上,舍弃了姿态和朝向信息,转而关注目标身份和边界框极值的视觉线索。这自然阻碍了复杂或模糊交互的识别。本研究通过可视化手段与精心设计的实验探讨了上述问题。据此,我们研究了如何通过交叉注意力机制重新引入图像特征。凭借改进的查询设计、对键和值的广泛探索,以及作为空间引导的边界框对位置嵌入,我们提出的增强谓语视觉上下文(PViC)模型在HICO-DET和V-COCO基准测试中优于现有最优方法,同时保持了较低的训练成本。