Human-Object Interaction (HOI) detection is a challenging computer vision task that requires visual models to address the complex interactive relationship between humans and objects and predict HOI triplets. Despite the challenges posed by the numerous interaction combinations, they also offer opportunities for multimodal learning of visual texts. In this paper, we present a systematic and unified framework (RmLR) that enhances HOI detection by incorporating structured text knowledge. Firstly, we qualitatively and quantitatively analyze the loss of interaction information in the two-stage HOI detector and propose a re-mining strategy to generate more comprehensive visual representation.Secondly, we design more fine-grained sentence- and word-level alignment and knowledge transfer strategies to effectively address the many-to-many matching problem between multiple interactions and multiple texts.These strategies alleviate the matching confusion problem that arises when multiple interactions occur simultaneously, thereby improving the effectiveness of the alignment process. Finally, HOI reasoning by visual features augmented with textual knowledge substantially improves the understanding of interactions. Experimental results illustrate the effectiveness of our approach, where state-of-the-art performance is achieved on public benchmarks. We further analyze the effects of different components of our approach to provide insights into its efficacy.
翻译:人-物交互检测是一项具有挑战性的计算机视觉任务,要求视觉模型处理人与物体之间复杂的交互关系并预测HOI三元组。尽管大量交互组合带来了挑战,但也为视觉文本的多模态学习提供了机遇。本文提出一个系统化的统一框架(RmLR),通过整合结构化文本知识来增强HOI检测。首先,我们定性和定量分析了两阶段HOI检测器中交互信息的损失,并提出一种重新挖掘策略以生成更全面的视觉表征。其次,我们设计了更细粒度的句子级与单词级对齐及知识迁移策略,有效解决多个交互与多个文本之间的多对多匹配问题。这些策略缓解了多个交互同时发生时产生的匹配混淆问题,从而提升对齐过程的有效性。最后,通过文本知识增强视觉特征进行HOI推理,显著提升了交互理解能力。实验结果证明了该方法的有效性,在公开基准上取得了最先进性能。我们进一步分析了方法各组成部分的效果,以深入揭示其效能机制。