GroundCount: Grounding Vision-Language Models with Object Detection for Mitigating Counting Hallucinations

Vision Language Models (VLMs) exhibit persistent hallucinations in counting tasks, with accuracy substantially lower than other visual reasoning tasks (excluding sentiment). This phenomenon persists even in state-of-the-art reasoning-capable VLMs. Conversely, CNN-based object detection models (ODMs) such as YOLO excel at spatial localization and instance counting with minimal computational overhead. We propose GroundCount, a framework that augments VLMs with explicit spatial grounding from ODMs to mitigate counting hallucinations. In the best case, our prompt-based augmentation strategy achieves 81.3% counting accuracy on the best-performing model (Ovis2.5-2B) - a 6.6pp improvement - while reducing inference time by 22% through elimination of hallucination-driven reasoning loops for stronger models. We conduct comprehensive ablation studies demonstrating that positional encoding is a critical component, being beneficial for stronger models but detrimental for weaker ones. Confidence scores, by contrast, introduce noise for most architectures and their removal improves performance in four of five evaluated models. We further evaluate feature-level fusion architectures, finding that explicit symbolic grounding via structured prompts outperforms implicit feature fusion despite sophisticated cross-attention mechanisms. Our approach yields consistent improvements across four of five evaluated VLM architectures (6.2--7.5pp), with one architecture exhibiting degraded performance due to incompatibility between its iterative reflection mechanisms and structured prompts. These results suggest that counting failures stem from fundamental spatial-semantic integration limitations rather than architecture-specific deficiencies, while highlighting the importance of architectural compatibility in augmentation strategies.

翻译：视觉语言模型（VLMs）在计数任务中持续存在幻觉现象，其准确率显著低于其他视觉推理任务（情感分析除外）。即使在当前最先进的具备推理能力的VLMs中，该现象依然存在。相比之下，基于CNN的目标检测模型（ODMs，如YOLO）在空间定位和实例计数方面表现出色，且计算开销极小。我们提出GroundCount框架，通过集成ODMs提供的显式空间定位信息来增强VLMs，从而缓解计数幻觉。在最佳情况下，我们基于提示的增强策略在性能最优的模型（Ovis2.5-2B）上实现了81.3%的计数准确率——提升了6.6个百分点——同时通过对较强模型消除幻觉驱动的推理循环，将推理时间减少了22%。我们进行了全面的消融实验，证明位置编码是关键组件：对较强模型有益，但对较弱模型反而有害。相比之下，置信度分数在多数架构中会引入噪声，移除该特征后，五个评估模型中有四个性能得到提升。我们进一步评估了特征级融合架构，发现尽管存在复杂的交叉注意力机制，通过结构化提示实现的显式符号定位仍优于隐式特征融合。我们的方法在五个评估的VLM架构中的四个上实现了稳定提升（6.2-7.5个百分点），仅有一个架构因迭代反思机制与结构化提示不兼容而出现性能下降。这些结果表明，计数失败源于空间-语义整合的根本性局限而非特定架构缺陷，同时凸显了增强策略中架构兼容性的重要性。