Multimodal Large Language Models have shown promising capabilities in bridging visual and textual reasoning, yet their reasoning capabilities in Open-Vocabulary Human-Object Interaction (OV-HOI) are limited by cross-modal hallucinations and occlusion-induced ambiguity. To address this, we propose \textbf{ImagineAgent}, an agentic framework that harmonizes cognitive reasoning with generative imagination for robust visual understanding. Specifically, our method innovatively constructs cognitive maps that explicitly model plausible relationships between detected entities and candidate actions. Subsequently, it dynamically invokes tools including retrieval augmentation, image cropping, and diffusion models to gather domain-specific knowledge and enriched visual evidence, thereby achieving cross-modal alignment in ambiguous scenarios. Moreover, we propose a composite reward that balances prediction accuracy and tool efficiency. Evaluations on SWIG-HOI and HICO-DET datasets demonstrate our SOTA performance, requiring approximately 20\% of training data compared to existing methods, validating our robustness and efficiency.
翻译:多模态大语言模型在视觉与文本推理的融合方面展现出潜力,但其在开放词汇人-物交互理解中的推理能力受限于跨模态幻觉及遮挡导致的歧义性。为解决此问题,我们提出\textbf{ImagineAgent}——一种将认知推理与生成式想象相协调的智能体框架,以实现鲁棒的视觉理解。具体而言,本方法创新性地构建认知图谱,显式建模检测实体与候选动作之间的合理关联。随后,该框架动态调用检索增强、图像裁剪及扩散模型等工具,以获取领域特定知识与增强的视觉证据,从而在歧义场景中实现跨模态对齐。此外,我们提出一种平衡预测精度与工具效率的复合奖励机制。在SWIG-HOI与HICO-DET数据集上的评估表明,本方法取得了最先进的性能,且仅需约20\%的训练数据即可达到现有方法的水平,验证了其鲁棒性与高效性。