Recent advances in visual-language models have shown remarkable zero-shot text-image matching ability that is transferable to down-stream tasks such as object detection and segmentation. However, adapting these models for object counting, which involves estimating the number of objects in an image, remains a formidable challenge. In this study, we conduct the first exploration of transferring visual-language models for class-agnostic object counting. Specifically, we propose CLIP-Count, a novel pipeline that estimates density maps for open-vocabulary objects with text guidance in a zero-shot manner, without requiring any finetuning on specific object classes. To align the text embedding with dense image features, we introduce a patch-text contrastive loss that guides the model to learn informative patch-level image representations for dense prediction. Moreover, we design a hierarchical patch-text interaction module that propagates semantic information across different resolution levels of image features. Benefiting from the full exploitation of the rich image-text alignment knowledge of pretrained visual-language models, our method effectively generates high-quality density maps for objects-of-interest. Extensive experiments on FSC-147, CARPK, and ShanghaiTech crowd counting datasets demonstrate that our proposed method achieves state-of-the-art accuracy and generalizability for zero-shot object counting. Project page at https://github.com/songrise/CLIP-Count
翻译:视觉-语言模型的最新进展展现出卓越的零样本文本-图像匹配能力,该能力可迁移至目标检测与分割等下游任务。然而,将这些模型应用于目标计数——即估计图像中目标数量——仍是一项严峻挑战。本研究首次探索将视觉-语言模型迁移至类别无关的目标计数任务。具体而言,我们提出CLIP-Count,一种新颖的流水线,以零样本方式通过文本引导为开放词汇目标估计密度图,无需针对特定目标类别进行微调。为将文本嵌入与密集图像特征对齐,我们引入补丁-文本对比损失,引导模型学习具备信息丰富性的补丁级图像表示,以支持密集预测。此外,我们设计了一种层次化补丁-文本交互模块,可在图像特征的不同分辨率层级间传播语义信息。得益于对预训练视觉-语言模型丰富的图像-文本对齐知识的充分利用,本方法有效生成了高质量的兴趣目标密度图。在FSC-147、CARPK和ShanghaiTech人群计数数据集上的大量实验证明,本方法在零样本目标计数中达到了最先进的准确率与泛化能力。项目页面:https://github.com/songrise/CLIP-Count