Human-centered dynamic scene understanding plays a pivotal role in enhancing the capability of robotic and autonomous systems, in which Video-based Human-Object Interaction (V-HOI) detection is a crucial task in semantic scene understanding, aimed at comprehensively understanding HOI relationships within a video to benefit the behavioral decisions of mobile robots and autonomous driving systems. Although previous V-HOI detection models have made significant strides in accurate detection on specific datasets, they still lack the general reasoning ability like human beings to effectively induce HOI relationships. In this study, we propose V-HOI Multi-LLMs Collaborated Reasoning (V-HOI MLCR), a novel framework consisting of a series of plug-and-play modules that could facilitate the performance of current V-HOI detection models by leveraging the strong reasoning ability of different off-the-shelf pre-trained large language models (LLMs). We design a two-stage collaboration system of different LLMs for the V-HOI task. Specifically, in the first stage, we design a Cross-Agents Reasoning scheme to leverage the LLM conduct reasoning from different aspects. In the second stage, we perform Multi-LLMs Debate to get the final reasoning answer based on the different knowledge in different LLMs. Additionally, we devise an auxiliary training strategy that utilizes CLIP, a large vision-language model to enhance the base V-HOI models' discriminative ability to better cooperate with LLMs. We validate the superiority of our design by demonstrating its effectiveness in improving the prediction accuracy of the base V-HOI model via reasoning from multiple perspectives.
翻译:以人为中心的动态场景理解在提升机器人与自主系统能力方面发挥着关键作用,其中基于视频的人-物交互(V-HOI)检测是语义场景理解中的一项关键任务,旨在全面理解视频中的人-物交互关系,以辅助移动机器人和自动驾驶系统的行为决策。尽管以往的V-HOI检测模型在特定数据集上的精确检测方面取得了显著进展,但它们仍缺乏类似人类的有效归纳人-物交互关系的通用推理能力。在本研究中,我们提出V-HOI多LLM协作推理(V-HOI MLCR),这是一个新颖的框架,包含一系列即插即用模块,通过利用不同现成预训练大语言模型(LLM)的强大推理能力,能够提升当前V-HOI检测模型的性能。我们针对V-HOI任务设计了一个由不同LLM组成的两阶段协作系统。具体而言,在第一阶段,我们设计了一种跨智能体推理方案,以利用LLM从不同方面进行推理。在第二阶段,我们执行多LLM辩论,基于不同LLM中的不同知识获取最终推理答案。此外,我们设计了一种辅助训练策略,利用大型视觉-语言模型CLIP增强基础V-HOI模型的判别能力,以更好地与LLM协作。通过展示其从多视角推理提升基础V-HOI模型预测准确性的有效性,我们验证了所提设计的优越性。