Large Vision-Language Models (LVLMs) have demonstrated remarkable success in a broad range of vision-language tasks, such as general visual question answering and optical character recognition (OCR). However, their performance on perception-centric tasks -- such as object detection, semantic segmentation, and depth estimation -- remains significantly inferior to that of task-specific expert models. For example, Qwen2.5-VL-7B-Instruct achieves only 19% mAP on COCO2017 val, particularly struggling with dense scenes and small object recall. In this work, we introduce Chain-of-Thought for Detection (CoT4Det), a simple but efficient strategy that reformulates perception tasks into three interpretable steps: classification, counting, and grounding -- each more naturally aligned with the reasoning capabilities of LVLMs. Extensive experiments demonstrate that our method significantly improves perception performance without compromising general vision language capabilities. With a standard Qwen2.5-VL-7B-Instruct, CoT4Det boosts mAP from 19.0% to 33.0% on COCO2017 val and achieves competitive results across a variety of perception benchmarks, outperforming baselines by +2% on RefCOCO series and 19% on Flickr30k entities.


翻译:大规模视觉语言模型(LVLMs)在广泛的视觉语言任务中取得了显著成功,例如通用视觉问答和光学字符识别(OCR)。然而,它们在感知中心任务(如目标检测、语义分割和深度估计)上的性能仍然显著低于任务特定的专家模型。例如,Qwen2.5-VL-7B-Instruct在COCO2017验证集上仅达到19%的mAP,尤其在密集场景和小目标召回方面表现不佳。在本工作中,我们引入了检测思维链(CoT4Det),这是一种简单而高效的策略,将感知任务重新表述为三个可解释的步骤:分类、计数和定位——每一步都更自然地与LVLMs的推理能力对齐。大量实验表明,我们的方法在不损害通用视觉语言能力的前提下,显著提升了感知性能。使用标准的Qwen2.5-VL-7B-Instruct,CoT4Det在COCO2017验证集上将mAP从19.0%提升至33.0%,并在多种感知基准测试中取得了有竞争力的结果,在RefCOCO系列上超越基线+2%,在Flickr30k实体上超越19%。

0
下载
关闭预览

相关内容

ACM/IEEE第23届模型驱动工程语言和系统国际会议,是模型驱动软件和系统工程的首要会议系列,由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来,模型涵盖了建模的各个方面,从语言和方法到工具和应用程序。模特的参加者来自不同的背景,包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛,参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会,并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。 官网链接:http://www.modelsconference.org/
Top
微信扫码咨询专知VIP会员