Bootstrapping MLLM for Weakly-Supervised Class-Agnostic Object Counting

Object counting is a fundamental task in computer vision, with broad applicability in many real-world scenarios. Fully-supervised counting methods require costly point-level annotations per object. Few weakly-supervised methods leverage only image-level object counts as supervision and achieve fairly promising results. They are, however, often limited to counting a single category, e.g. person. In this paper, we propose WS-COC, the first MLLM-driven weakly-supervised framework for class-agnostic object counting. Instead of directly fine-tuning MLLMs to predict object counts, which can be challenging due to the modality gap, we incorporate three simple yet effective strategies to bootstrap the counting paradigm in both training and testing: First, a divide-and-discern dialogue tuning strategy is proposed to guide the MLLM to determine whether the object count falls within a specific range and progressively break down the range through multi-round dialogue. Second, a compare-and-rank count optimization strategy is introduced to train the MLLM to optimize the relative ranking of multiple images according to their object counts. Third, a global-and-local counting enhancement strategy aggregates and fuses local and global count predictions to improve counting performance in dense scenes. Extensive experiments on FSC-147, CARPK, PUCPR+, and ShanghaiTech show that WS-COC matches or even surpasses many state-of-art fully-supervised methods while significantly reducing annotation costs. Code is available at https://github.com/viscom-tongji/WS-COC.

翻译：物体计数是计算机视觉领域的一项基础任务，在众多现实场景中具有广泛的应用价值。全监督计数方法需要为每个物体提供成本高昂的点级标注。少数弱监督方法仅利用图像级物体数量作为监督信号，并取得了相当有前景的结果。然而，这些方法通常仅限于对单一类别（例如行人）进行计数。本文提出了WS-COC，这是首个基于多模态大语言模型（MLLM）驱动的弱监督类别无关物体计数框架。为避免因模态差异而直接微调MLLM预测物体数量所面临的挑战，我们引入了三种简单而有效的策略，在训练和测试阶段共同引导计数范式的自举优化：首先，提出一种“分而辨之”的对话调优策略，引导MLLM判断物体数量是否落在特定区间内，并通过多轮对话逐步细化该区间。其次，引入一种“比较排序”的计数优化策略，训练MLLM根据物体数量对多张图像进行相对排序优化。第三，采用“全局-局部”计数增强策略，通过聚合与融合局部和全局的计数预测，提升密集场景下的计数性能。在FSC-147、CARPK、PUCPR+和ShanghaiTech数据集上的大量实验表明，WS-COC在显著降低标注成本的同时，达到甚至超越了许多先进的全监督方法。代码已开源：https://github.com/viscom-tongji/WS-COC。