Object counting in complex scenes is particularly challenging in the zero-shot (ZS) setting, where instances of unseen categories are counted using only a class name. Existing ZS counting methods that infer exemplars from text often rely on off-the-shelf open-vocabulary detectors (OVDs), which in dense scenes suffer from semantic noise, appearance variability, and multi-instance proposals. Alternatively, random image-patch sampling is employed, which fails to accurately delineate object instances. Since counting is sensitive to exemplar quality, such selection strategies often yield poorly representative exemplars, leading to inaccurate count estimation. To address these issues, we propose CountZES, an inference-only approach for object counting via ZS exemplar selection. CountZES discovers diverse exemplars through three synergistic stages: Detection-Anchored Exemplar (DAE), Density-Guided Exemplar (DGE), and Feature-Consensus Exemplar (FCE). DAE refines OVD detections to isolate precise single-instance exemplars. DGE introduces a density-driven, self-supervised paradigm to identify statistically consistent and semantically compact exemplars, while FCE reinforces visual coherence through feature-space clustering. Together, these stages yield a complementary exemplar set that balances textual grounding, count consistency, and feature representativeness. Experiments on diverse datasets demonstrate CountZES superior performance among ZOC methods while generalizing effectively across domains.
翻译:复杂场景中的目标计数在零样本设定下尤为具有挑战性,此时仅通过类别名称对未见类别的实例进行计数。现有通过文本推断范例的零样本计数方法通常依赖现成的开放词汇检测器(OVD),而在密集场景中,这些检测器会面临语义噪声、外观多样性以及多实例提案等问题。作为替代方案,随机图像块采样方法被采用,但无法准确界定目标实例。由于计数对范例质量敏感,此类选择策略往往生成代表性较差的范例,导致计数估计不准确。为解决这些问题,我们提出CountZES——一种仅通过零样本范例选择实现目标计数的纯推理方法。CountZES通过三个协同阶段发现多样化范例:检测锚定范例(DAE)、密度引导范例(DGE)和特征共识范例(FCE)。DAE优化OVD检测结果以提取精确的单实例范例;DGE引入密度驱动的自监督范式,识别统计一致且语义紧凑的范例;FCE则通过特征空间聚类强化视觉一致性。这三个阶段协同生成互补范例集合,在文本对齐、计数一致性和特征代表性之间取得平衡。在多个数据集上的实验表明,CountZES在零样本目标计数方法中展现出优越性能,且能有效跨领域泛化。