CLIP-Count: Towards Text-Guided Zero-Shot Object Counting

Recent advances in visual-language models have shown remarkable zero-shot text-image matching ability that is transferable to down-stream tasks such as object detection and segmentation. However, adapting these models for object counting, which involves estimating the number of objects in an image, remains a formidable challenge. In this study, we conduct the first exploration of transferring visual-language models for class-agnostic object counting. Specifically, we propose CLIP-Count, a novel pipeline that estimates density maps for open-vocabulary objects with text guidance in a zero-shot manner, without requiring any finetuning on specific object classes. To align the text embedding with dense image features, we introduce a patch-text contrastive loss that guides the model to learn informative patch-level image representations for dense prediction. Moreover, we design a hierarchical patch-text interaction module that propagates semantic information across different resolution levels of image features. Benefiting from the full exploitation of the rich image-text alignment knowledge of pretrained visual-language models, our method effectively generates high-quality density maps for objects-of-interest. Extensive experiments on FSC-147, CARPK, and ShanghaiTech crowd counting datasets demonstrate that our proposed method achieves state-of-the-art accuracy and generalizability for zero-shot object counting. Project page at https://github.com/songrise/CLIP-Count

翻译：视觉-语言模型的最新进展展现出卓越的零样本文本-图像匹配能力，该能力可迁移至目标检测与分割等下游任务。然而，将这些模型应用于目标计数——即估计图像中目标数量——仍是一项严峻挑战。本研究首次探索将视觉-语言模型迁移至类别无关的目标计数任务。具体而言，我们提出CLIP-Count，一种新颖的流水线，以零样本方式通过文本引导为开放词汇目标估计密度图，无需针对特定目标类别进行微调。为将文本嵌入与密集图像特征对齐，我们引入补丁-文本对比损失，引导模型学习具备信息丰富性的补丁级图像表示，以支持密集预测。此外，我们设计了一种层次化补丁-文本交互模块，可在图像特征的不同分辨率层级间传播语义信息。得益于对预训练视觉-语言模型丰富的图像-文本对齐知识的充分利用，本方法有效生成了高质量的兴趣目标密度图。在FSC-147、CARPK和ShanghaiTech人群计数数据集上的大量实验证明，本方法在零样本目标计数中达到了最先进的准确率与泛化能力。项目页面：https://github.com/songrise/CLIP-Count

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/