Visual object counting has recently shifted towards class-agnostic counting (CAC), which addresses the challenge of counting objects across arbitrary categories, a crucial capability for flexible and generalizable counting systems. Unlike humans, who effortlessly identify and count objects from diverse categories without prior knowledge, most existing counting methods are restricted to enumerating instances of known classes, requiring extensive labeled datasets for training and struggling in open-vocabulary settings. In contrast, CAC aims to count objects belonging to classes never seen during training, operating in a few-shot setting. In this paper, we present the first comprehensive review of CAC methodologies. We propose a taxonomy to categorize CAC approaches into three paradigms based on how target object classes can be specified: reference-based, reference-less, and open-world text-guided. Reference-based approaches achieve state-of-the-art performance by relying on exemplar-guided mechanisms. Reference-less methods eliminate exemplar dependency by leveraging inherent image patterns. Finally, open-world text-guided methods use vision-language models, enabling object class descriptions via textual prompts, offering a flexible and promising solution. Based on this taxonomy, we provide an overview of 30 CAC architectures and report their performance on gold-standard benchmarks, discussing key strengths and limitations. Specifically, we present results on the FSC-147 dataset, setting a leaderboard using gold-standard metrics, and on the CARPK dataset to assess generalization capabilities. Finally, we offer a critical discussion of persistent challenges, such as annotation dependency and generalization, alongside future directions.
翻译:视觉对象计数研究近期正转向类别无关计数(CAC),该方向致力于解决跨任意类别计数对象的挑战,这是构建灵活且可泛化计数系统的关键能力。与人类无需先验知识即可轻松识别和计数不同类别对象的能力不同,现有大多数计数方法仅限于枚举已知类别的实例,需要大量标注数据集进行训练,且在开放词汇场景中表现不佳。相比之下,CAC旨在计数训练阶段从未见过的类别对象,在少样本设置下运行。本文首次对CAC方法学进行全面综述,提出基于目标对象类别指定方式的分类体系,将CAC方法划分为三类范式:基于参考的方法、无参考方法和开放世界文本引导方法。基于参考的方法通过范例引导机制实现了最先进的性能;无参考方法通过利用图像固有模式消除了对范例的依赖;而开放世界文本引导方法则借助视觉-语言模型,通过文本提示描述对象类别,提供了灵活且前景广阔的解决方案。基于此分类体系,本文系统综述了30种CAC架构,报告了其在基准测试中的性能表现并讨论了核心优势与局限。具体而言,我们在FSC-147数据集上使用标准评估指标建立了性能排行榜,并在CARPK数据集上评估了泛化能力。最后,我们对标注依赖性和泛化能力等持续存在的挑战进行了批判性讨论,并展望了未来研究方向。