In this paper, we consider the problem of generalised visual object counting, with the goal of developing a computational model for counting the number of objects from arbitrary semantic categories, using arbitrary number of "exemplars", i.e. zero-shot or few-shot counting. To this end, we make the following four contributions: (1) We introduce a novel transformer-based architecture for generalised visual object counting, termed as Counting Transformer (CounTR), which explicitly capture the similarity between image patches or with given "exemplars" with the attention mechanism;(2) We adopt a two-stage training regime, that first pre-trains the model with self-supervised learning, and followed by supervised fine-tuning;(3) We propose a simple, scalable pipeline for synthesizing training images with a large number of instances or that from different semantic categories, explicitly forcing the model to make use of the given "exemplars";(4) We conduct thorough ablation studies on the large-scale counting benchmark, e.g. FSC-147, and demonstrate state-of-the-art performance on both zero and few-shot settings.
翻译:本文研究广义视觉目标计数问题,旨在开发一种计算模型,能够使用任意数量的“样例”(即零样本或少样本计数)对任意语义类别的目标数量进行统计。为此,我们做出以下四项贡献:(1)提出一种新颖的基于Transformer架构的广义视觉目标计数方法,称为计数Transformer(CounTR),该方法通过注意力机制显式捕捉图像块之间或与给定“样例”之间的相似性;(2)采用两阶段训练策略,先通过自监督学习对模型进行预训练,再进行有监督微调;(3)提出一种简单且可扩展的合成训练图像流水线,可生成包含大量实例或不同语义类别的图像,从而强制模型利用给定的“样例”;(4)在大规模计数基准(如FSC-147)上进行全面的消融实验,证明该方法在零样本和少样本场景中均达到最先进水平。