Class-agnostic counting (CAC) aims to count objects of interest from a query image given few exemplars. This task is typically addressed by extracting the features of query image and exemplars respectively with (un)shared feature extractors and by matching their feature similarity, leading to an extract-\textit{then}-match paradigm. In this work, we show that CAC can be simplified in an extract-\textit{and}-match manner, particularly using a pretrained and plain vision transformer (ViT) where feature extraction and similarity matching are executed simultaneously within the self-attention. We reveal the rationale of such simplification from a decoupled view of the self-attention and point out that the simplification is only made possible if the query and exemplar tokens are concatenated as input. The resulting model, termed CACViT, simplifies the CAC pipeline and unifies the feature spaces between the query image and exemplars. In addition, we find CACViT naturally encodes background information within self-attention, which helps reduce background disturbance. Further, to compensate the loss of the scale and the order-of-magnitude information due to resizing and normalization in ViT, we present two effective strategies for scale and magnitude embedding. Extensive experiments on the FSC147 and the CARPK datasets show that CACViT significantly outperforms state-of-the-art CAC approaches in both effectiveness (23.60% error reduction) and generalization, which suggests CACViT provides a concise and strong baseline for CAC. Code will be available.
翻译:类别无关计数(CAC)旨在通过给定少量示例图像,从查询图像中统计感兴趣的目标数量。通常该任务通过分别使用(非)共享特征提取器提取查询图像和示例图像的特征,并匹配其特征相似度来解决,形成"提取-然后-匹配"范式。在本工作中,我们证明CAC可以简化为"提取-并-匹配"方式,尤其在使用预训练纯视觉Transformer(ViT)时,特征提取和相似度匹配可在自注意力机制中同步完成。我们从自注意力的解耦视角揭示了这种简化的原理,并指出仅当查询令牌与示例令牌被拼接为输入时,这种简化才可能实现。所得模型称为CACViT,它简化了CAC流程并统一了查询图像与示例图像的特征空间。此外,我们发现CACViT在自注意力中天然编码了背景信息,有助于减少背景干扰。为补偿ViT中图像缩放和归一化导致的尺度与量级信息损失,我们提出了两种有效的尺度与量级嵌入策略。在FSC147和CARPK数据集上的大量实验表明,CACViT在有效性(错误率降低23.60%)和泛化能力上均显著超越现有最优CAC方法,表明CACViT为CAC任务提供了简洁而强大的基线。代码将公开提供。