Visual relationship detection aims to identify objects and their relationships in images. Prior methods approach this task by adding separate relationship modules or decoders to existing object detection architectures. This separation increases complexity and hinders end-to-end training, which limits performance. We propose a simple and highly efficient decoder-free architecture for open-vocabulary visual relationship detection. Our model consists of a Transformer-based image encoder that represents objects as tokens and models their relationships implicitly. To extract relationship information, we introduce an attention mechanism that selects object pairs likely to form a relationship. We provide a single-stage recipe to train this model on a mixture of object and relationship detection data. Our approach achieves state-of-the-art relationship detection performance on Visual Genome and on the large-vocabulary GQA benchmark at real-time inference speeds. We provide analyses of zero-shot performance, ablations, and real-world qualitative examples.
翻译:视觉关系检测旨在识别图像中的目标及其关系。现有方法通过在目标检测架构中添加独立的关系统计模块或解码器来处理此任务。这种分离增加了复杂性并阻碍端到端训练,从而限制了性能。我们提出了一种简单且高效的免解码器架构,用于开放词汇视觉关系检测。模型由基于Transformer的图像编码器组成,该编码器将目标表示为标记并隐式建模其关系。为提取关系信息,我们引入了一种注意力机制,用于选择可能形成关系的目标对。我们提供了一种单阶段训练策略,在目标和关系检测数据的混合集上训练该模型。我们的方法在Visual Genome和大词汇量GQA基准上实现了最先进的关系检测性能,且推理速度达到实时。我们还提供了零样本性能分析、消融实验以及真实场景定性示例。