Counting is a fundamental operation for various real-world visual tasks, requiring both object recognition and robust counting capabilities. Despite their advanced visual perception, large vision-language models (LVLMs) are known to struggle with counting tasks. In this work, we evaluate the performance of several LVLMs on visual counting tasks across multiple counting and vision datasets. We observe that while their performance may be less prone to error for small numbers of objects, they exhibit significant weaknesses as the number of objects increases. To alleviate this issue, we propose a simple yet effective baseline method that enhances LVLMs' counting ability for large numbers of objects using a divide-and-conquer approach. Our method decomposes counting problems into sub-tasks. Moreover, it incorporates a mechanism to prevent objects from being split during division, which could otherwise lead to repetitive counting -- a common issue in a naive divide-and-conquer implementation. We demonstrate the effectiveness of this approach across various datasets and benchmarks, establishing it as a valuable reference for evaluating future solutions.
翻译:计数是各类现实世界视觉任务的基础操作,需同时具备目标识别与鲁棒计数能力。尽管大型视觉语言模型(LVLMs)具备先进的视觉感知能力,但其在计数任务上的表现仍存在明显不足。本研究评估了多种LVLMs在多个计数与视觉数据集上的视觉计数任务表现。我们发现,虽然这些模型在目标数量较少时错误率相对较低,但随着目标数量的增加,其性能缺陷显著暴露。为缓解此问题,我们提出一种简单而有效的基线方法,通过分治策略增强LVLMs对大数量目标的计数能力。该方法将计数问题分解为若干子任务,并引入防止目标在分割过程中被切割的机制——这种切割在朴素的分治实现中可能导致重复计数。我们在多个数据集与基准测试中验证了该方法的有效性,为未来解决方案的评估提供了有价值的参考基准。