Towards a Unified Transformer-based Framework for Scene Graph Generation and Human-object Interaction Detection

Scene graph generation (SGG) and human-object interaction (HOI) detection are two important visual tasks aiming at localising and recognising relationships between objects, and interactions between humans and objects, respectively. Prevailing works treat these tasks as distinct tasks, leading to the development of task-specific models tailored to individual datasets. However, we posit that the presence of visual relationships can furnish crucial contextual and intricate relational cues that significantly augment the inference of human-object interactions. This motivates us to think if there is a natural intrinsic relationship between the two tasks, where scene graphs can serve as a source for inferring human-object interactions. In light of this, we introduce SG2HOI+, a unified one-step model based on the Transformer architecture. Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection. Concretely, we initiate a relation Transformer tasked with generating relation triples from a suite of visual features. Subsequently, we employ another transformer-based decoder to predict human-object interactions based on the generated relation triples. A comprehensive series of experiments conducted across established benchmark datasets including Visual Genome, V-COCO, and HICO-DET demonstrates the compelling performance of our SG2HOI+ model in comparison to prevalent one-stage SGG models. Remarkably, our approach achieves competitive performance when compared to state-of-the-art HOI methods. Additionally, we observe that our SG2HOI+ jointly trained on both SGG and HOI tasks in an end-to-end manner yields substantial improvements for both tasks compared to individualized training paradigms.

翻译：场景图生成和人体-物体交互检测是两项重要的视觉任务，分别旨在定位并识别物体之间的关系，以及人体与物体之间的交互。现有研究通常将这两项任务视为独立问题，导致产生了针对特定数据集定制的专用模型。然而，我们认为视觉关系的存在能够提供关键的上下文与复杂关联线索，显著增强人体-物体交互的推理能力。这促使我们思考两项任务之间是否存在天然的内在联系，使得场景图可作为推理人体-物体交互的源信息。基于此，我们提出SG2HOI+——一种基于Transformer架构的统一单步模型。该方法采用两个交互式分层Transformer，将SGG与HOI检测任务无缝统一。具体而言，我们首先初始化一个关系Transformer，负责从一组视觉特征中生成关系三元组；随后，利用另一个基于Transformer的解码器，根据生成的关系三元组预测人体-物体交互。在Visual Genome、V-COCO和HICO-DET等公认基准数据集上开展的一系列全面实验表明，我们的SG2HOI+模型相较于主流单阶段SGG模型展现出令人信服的性能。值得注意的是，与最先进的HOI方法相比，该方法取得了具有竞争力的表现。此外，我们观察到，SG2HOI+在SGG与HOI任务上以端到端方式联合训练，相较于独立训练范式，为两项任务均带来了显著提升。