Accurate visual understanding is imperative for advancing autonomous systems and intelligent robots. Despite the powerful capabilities of vision-language models (VLMs) in processing complex visual scenes, precisely recognizing obscured or ambiguously presented visual elements remains challenging. To tackle such issues, this paper proposes InsightSee, a multi-agent framework to enhance VLMs' interpretative capabilities in handling complex visual understanding scenarios. The framework comprises a description agent, two reasoning agents, and a decision agent, which are integrated to refine the process of visual information interpretation. The design of these agents and the mechanisms by which they can be enhanced in visual information processing are presented. Experimental results demonstrate that the InsightSee framework not only boosts performance on specific visual tasks but also retains the original models' strength. The proposed framework outperforms state-of-the-art algorithms in 6 out of 9 benchmark tests, with a substantial advancement in multimodal understanding.
翻译:精确的视觉理解对于推进自主系统和智能机器人至关重要。尽管视觉-语言模型在处理复杂视觉场景方面具有强大能力,但精确识别被遮挡或模糊呈现的视觉元素仍然具有挑战性。为解决此类问题,本文提出InsightSee——一个增强视觉-语言模型在复杂视觉理解场景中解释能力的多智能体框架。该框架包含描述智能体、两个推理智能体及决策智能体,通过系统整合优化视觉信息解析流程。本文详细阐述了这些智能体的设计及其在视觉信息处理中的增强机制。实验结果表明,InsightSee框架不仅能提升特定视觉任务的性能,同时保持了原始模型的优势。在9项基准测试中,该框架在6项任务上超越了现有最优算法,在多模态理解方面实现了显著进步。