Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models

Human-object interaction (HOI) detection aims to comprehend the intricate relationships between humans and objects, predicting $<human, action, object>$ triplets, and serving as the foundation for numerous computer vision tasks. The complexity and diversity of human-object interactions in the real world, however, pose significant challenges for both annotation and recognition, particularly in recognizing interactions within an open world context. This study explores the universal interaction recognition in an open-world setting through the use of Vision-Language (VL) foundation models and large language models (LLMs). The proposed method is dubbed as \emph{\textbf{UniHOI}}. We conduct a deep analysis of the three hierarchical features inherent in visual HOI detectors and propose a method for high-level relation extraction aimed at VL foundation models, which we call HO prompt-based learning. Our design includes an HO Prompt-guided Decoder (HOPD), facilitates the association of high-level relation representations in the foundation model with various HO pairs within the image. Furthermore, we utilize a LLM (\emph{i.e.} GPT) for interaction interpretation, generating a richer linguistic understanding for complex HOIs. For open-category interaction recognition, our method supports either of two input types: interaction phrase or interpretive sentence. Our efficient architecture design and learning methods effectively unleash the potential of the VL foundation models and LLMs, allowing UniHOI to surpass all existing methods with a substantial margin, under both supervised and zero-shot settings. The code and pre-trained weights are available at: \url{https://github.com/Caoyichao/UniHOI}.

翻译：人-物交互（HOI）检测旨在理解人与物体之间的复杂关系，预测<人，动作，物体>三元组，并作为众多计算机视觉任务的基础。然而，现实世界中的人-物交互复杂多样，这为标注和识别带来了巨大挑战，尤其在开放世界环境中识别交互。本研究通过利用视觉-语言（VL）基础模型和大语言模型（LLMs），探讨了开放世界场景下的通用交互识别。所提出的方法称为\emph{\textbf{UniHOI}}。我们对视觉HOI检测器中固有的三个层次特征进行了深入分析，并提出了一种针对VL基础模型的高级关系提取方法，即基于HO提示的学习。我们的设计包括一个HO提示引导解码器（HOPD），促进基础模型中的高级关系表示与图像中各种HO对之间的关联。此外，我们利用LLM（\emph{即} GPT）进行交互解释，为复杂HOI生成更丰富的语言理解。对于开放类别交互识别，我们的方法支持两种输入类型：交互短语或解释句。我们高效的架构设计和学习方法有效释放了VL基础模型和LLMs的潜力，使UniHOI在监督学习和零样本设置下均以显著优势超越了所有现有方法。代码和预训练权重可在以下网址获取：\url{https://github.com/Caoyichao/UniHOI}。