Human-Object Interaction (HOI) detection is a longstanding computer vision problem concerned with predicting the interaction between humans and objects. Current HOI models rely on a vocabulary of interactions at training and inference time, limiting their applicability to static environments. With the advent of Multimodal Large Language Models (MLLMs), it has become feasible to explore more flexible paradigms for interaction recognition. In this work, we revisit HOI detection through the lens of MLLMs and apply them to in-the-wild HOI detection. We define the Unconstrained HOI (U-HOI) task, a novel HOI domain that removes the requirement for a predefined list of interactions at both training and inference. We evaluate a range of MLLMs on this setting and introduce a pipeline that includes test-time inference and language-to-graph conversion to extract structured interactions from free-form text. Our findings highlight the limitations of current HOI detectors and the value of MLLMs for U-HOI. Code will be available at https://github.com/francescotonini/anyhoi
翻译:人-物交互(HOI)检测是一个长期的计算机视觉问题,旨在预测人与物体之间的交互行为。当前的HOI模型在训练和推理阶段依赖预定义的交互词汇表,这限制了其在静态环境中的应用。随着多模态大语言模型(MLLMs)的出现,探索更灵活的交互识别范式已成为可能。本文从MLLMs的视角重新审视HOI检测,并将其应用于野外场景下的HOI检测。我们定义了无约束HOI(U-HOI)任务,这是一个新颖的HOI领域,去除了训练和推理阶段对预定义交互列表的要求。我们在此设定下评估了一系列MLLMs,并引入了一个包含测试时推理和语言到图转换的流水线,从自由形式文本中提取结构化交互。我们的研究结果揭示了当前HOI检测器的局限性以及MLLMs在U-HOI中的价值。代码将公开于https://github.com/francescotonini/anyhoi