VLCounter: Text-aware Visual Representation for Zero-Shot Object Counting

Zero-Shot Object Counting (ZSOC) aims to count referred instances of arbitrary classes in a query image without human-annotated exemplars. To deal with ZSOC, preceding studies proposed a two-stage pipeline: discovering exemplars and counting. However, there remains a challenge of vulnerability to error propagation of the sequentially designed two-stage process. In this work, an one-stage baseline, Visual-Language Baseline (VLBase), exploring the implicit association of the semantic-patch embeddings of CLIP is proposed. Subsequently, the extension of VLBase to Visual-language Counter (VLCounter) is achieved by incorporating three modules devised to tailor VLBase for object counting. First, Semantic-conditioned Prompt Tuning (SPT) is introduced within the image encoder to acquire target-highlighted representations. Second, Learnable Affine Transformation (LAT) is employed to translate the semantic-patch similarity map to be appropriate for the counting task. Lastly, the layer-wisely encoded features are transferred to the decoder through Segment-aware Skip Connection (SaSC) to keep the generalization capability for unseen classes. Through extensive experiments on FSC147, CARPK, and PUCPR+, the benefits of the end-to-end framework, VLCounter, are demonstrated.

翻译：零样本目标计数（ZSOC）旨在无需人工标注示例的情况下，对查询图像中任意类别的指定实例进行计数。针对ZSOC问题，先前研究提出了一种两阶段流程：发现示例并进行计数。然而，顺序设计的二阶段过程存在易受误差传播影响的挑战。本研究提出了一种单阶段基线模型——视觉-语言基线（VLBase），该模型探索了CLIP中语义-补丁嵌入的隐式关联。随后，通过引入三个专用于将VLBase适配为目标计数任务的模块，实现了VLBase向视觉-语言计数器（VLCounter）的扩展。首先，在图像编码器中引入语义条件提示微调（SPT），以获取目标高亮表示。其次，采用可学习仿射变换（LAT）将语义-补丁相似度图转换为适用于计数任务的形式。最后，通过分段感知跳跃连接（SaSC）将逐层编码的特征传递至解码器，以保持对未见类别的泛化能力。通过在FSC147、CARPK和PUCPR+数据集上的大量实验，验证了端到端框架VLCounter的优势。

相关内容