Object counting in complex scenes is particularly challenging in the zero-shot (ZS) setting, where instances of unseen categories are counted using only a class name. Existing ZS counting methods that infer exemplars from text often rely on off-the-shelf open-vocabulary detectors (OVDs), which in dense scenes suffer from semantic noise, appearance variability, and frequent multi-instance proposals. Alternatively, random image-patch sampling is employed, which fails to accurately delineate object instances. To address these issues, we propose CountZES, an inference-only approach for object counting via ZS exemplar selection. CountZES discovers diverse exemplars through three synergistic stages: Detection-Anchored Exemplar (DAE), Density-Guided Exemplar (DGE), and Feature-Consensus Exemplar (FCE). DAE refines OVD detections to isolate precise single-instance exemplars. DGE introduces a density-driven, self-supervised paradigm to identify statistically consistent and semantically compact exemplars, while FCE reinforces visual coherence through feature-space clustering. Together, these stages yield a complementary exemplar set that balances textual grounding, count consistency, and feature representativeness. Experiments on diverse datasets demonstrate CountZES superior performance among ZOC methods while generalizing effectively across domains.
翻译:复杂场景中的物体计数在零样本(ZS)设置下尤其具有挑战性,该设置仅使用类别名称对未见类别的实例进行计数。现有从文本推断范例的ZS计数方法通常依赖于现成的开放词汇检测器(OVD),这些检测器在密集场景中易受语义噪声、外观多变性和频繁的多实例提议影响。另一种方法是采用随机图像块采样,但该方法无法准确描绘物体实例。为解决这些问题,我们提出了CountZES,一种通过ZS范例选择的纯推理物体计数方法。CountZES通过三个协同阶段发现多样化范例:检测锚定范例(DAE)、密度引导范例(DGE)和特征共识范例(FCE)。DAE通过精炼OVD检测来隔离精确的单实例范例。DGE引入了一种密度驱动的自监督范式,以识别统计一致且语义紧凑的范例,而FCE则通过特征空间聚类增强视觉一致性。这些阶段共同产生一个互补的范例集,平衡了文本基础、计数一致性和特征代表性。在多样化数据集上的实验表明,CountZES在ZOC方法中具有优越性能,并能有效跨领域泛化。