Recent advances in visual-language models have shown remarkable zero-shot text-image matching ability that is transferable to downstream tasks such as object detection and segmentation. Adapting these models for object counting, however, remains a formidable challenge. In this study, we first investigate transferring vision-language models (VLMs) for class-agnostic object counting. Specifically, we propose CLIP-Count, the first end-to-end pipeline that estimates density maps for open-vocabulary objects with text guidance in a zero-shot manner. To align the text embedding with dense visual features, we introduce a patch-text contrastive loss that guides the model to learn informative patch-level visual representations for dense prediction. Moreover, we design a hierarchical patch-text interaction module to propagate semantic information across different resolution levels of visual features. Benefiting from the full exploitation of the rich image-text alignment knowledge of pretrained VLMs, our method effectively generates high-quality density maps for objects-of-interest. Extensive experiments on FSC-147, CARPK, and ShanghaiTech crowd counting datasets demonstrate state-of-the-art accuracy and generalizability of the proposed method. Code is available: https://github.com/songrise/CLIP-Count.
翻译:近期视觉-语言模型的进展展示了其显著的零样本文本-图像匹配能力,该能力可迁移至目标检测和分割等下游任务。然而,将这些模型应用于目标计数仍是一项艰巨挑战。本研究首先探索将视觉-语言模型(VLM)迁移至类别无关的目标计数任务。具体而言,我们提出CLIP-Count——首个以零样本方式通过文本引导估计开放词汇目标密度图的端到端流水线。为使文本嵌入与密集视觉特征对齐,我们引入一种补丁-文本对比损失函数,引导模型学习富含信息的补丁级视觉表征,用于密集预测。此外,我们设计了一个层级化补丁-文本交互模块,以在不同分辨率层的视觉特征间传播语义信息。得益于对预训练VLM中丰富的图像-文本对齐知识的充分利用,我们的方法能有效生成感兴趣目标的高质量密度图。在FSC-147、CARPK和上海Tech人群计数数据集上的大量实验表明,所提方法具有最先进的准确性和泛化能力。代码地址:https://github.com/songrise/CLIP-Count。