Human-object interaction (HOI) detection plays a key role in high-level visual understanding, facilitating a deep comprehension of human activities. Specifically, HOI detection aims to locate the humans and objects involved in interactions within images or videos and classify the specific interactions between them. The success of this task is influenced by several key factors, including the accurate localization of human and object instances, as well as the correct classification of object categories and interaction relationships. This paper systematically summarizes and discusses the recent work in image-based HOI detection. First, the mainstream datasets involved in HOI relationship detection are introduced. Furthermore, starting with two-stage methods and end-to-end one-stage detection approaches, this paper comprehensively discusses the current developments in image-based HOI detection, analyzing the strengths and weaknesses of these two methods. Additionally, the advancements of zero-shot learning, weakly supervised learning, and the application of large-scale language models in HOI detection are discussed. Finally, the current challenges in HOI detection are outlined, and potential research directions and future trends are explored.
翻译:人-物交互(HOI)检测在高级视觉理解中扮演着关键角色,有助于深入理解人类活动。具体而言,HOI检测旨在定位图像或视频中参与交互的人类和物体,并对它们之间的具体交互关系进行分类。该任务的成功受多个关键因素影响,包括人类与物体实例的精确定位,以及物体类别与交互关系的正确分类。本文系统性地总结并讨论了基于图像的HOI检测的最新工作。首先,介绍了HOI关系检测涉及的主流数据集。进而,从两阶段方法和端到端单阶段检测方法出发,全面探讨了当前基于图像的HOI检测发展现状,分析了这两类方法的优缺点。此外,还讨论了零样本学习、弱监督学习以及大规模语言模型在HOI检测中的应用进展。最后,概述了当前HOI检测面临的挑战,并展望了潜在的研究方向与未来趋势。