Human-Object Interaction (HOI) detection is a challenging computer vision task that requires visual models to address the complex interactive relationship between humans and objects and predict HOI triplets. Despite the challenges posed by the numerous interaction combinations, they also offer opportunities for multimodal learning of visual texts. In this paper, we present a systematic and unified framework (RmLR) that enhances HOI detection by incorporating structured text knowledge. Firstly, we qualitatively and quantitatively analyze the loss of interaction information in the two-stage HOI detector and propose a re-mining strategy to generate more comprehensive visual representation.Secondly, we design more fine-grained sentence- and word-level alignment and knowledge transfer strategies to effectively address the many-to-many matching problem between multiple interactions and multiple texts.These strategies alleviate the matching confusion problem that arises when multiple interactions occur simultaneously, thereby improving the effectiveness of the alignment process. Finally, HOI reasoning by visual features augmented with textual knowledge substantially improves the understanding of interactions. Experimental results illustrate the effectiveness of our approach, where state-of-the-art performance is achieved on public benchmarks. We further analyze the effects of different components of our approach to provide insights into its efficacy.
翻译:人-物交互(HOI)检测是一项具有挑战性的计算机视觉任务,要求视觉模型解决人与物体之间的复杂交互关系并预测HOI三元组。尽管众多交互组合带来了挑战,但它们也为视觉-文本多模态学习提供了机遇。本文提出了一套系统且统一的框架(RmLR),通过整合结构化文本知识来增强HOI检测。首先,我们从定性和定量角度分析了两阶段HOI检测器中交互信息的损失,并提出了一种再挖掘策略以生成更全面的视觉表示。其次,我们设计了更细粒度的句子级和单词级对齐与知识迁移策略,以有效处理多交互与多文本之间的多对多匹配问题。这些策略缓解了多个交互同时发生时出现的匹配混乱问题,从而提升了对齐过程的有效性。最后,通过文本知识增强的视觉特征进行HOI推理,显著提升了对交互的理解。实验结果证明了我们方法的有效性,在公开基准上取得了最先进的性能。我们进一步分析了方法中不同组件的影响,以深入理解其有效性。