Text-guided open-vocabulary object counting (TOOC) aims to count objects belonging to the categories specified by natural language descriptions. Although vision-language pre-trained models have been successful applied to TOOC tasks, they still struggle with fine-grained spatial understanding and real-time inference requirements in counting scenarios. To address these limitations, this paper proposes a real-time TOOC framework, called the Real-Time Counter (RT-Counter), that achieves not only good counting accuracy but also high computational efficiency. RT-Counter designs a novel Visual Prototype Textualization (VPT) module that can project learned visual features into a text feature space and then generate features containing the abstract information that is hard to capture with visual prototypes and the detailed prototype information that is difficult to describe in text, enhancing the object-level visual-language model's counting capabilities. Additionally, RT-Counter incorporates our Weaving Transformer (Weaformer) layers, maintaining high descriptive power at a fraction of the computational cost. The Weaformer layer adopts a novel hybrid attention mechanism that can efficiently weave together local and global visual features. Extensive experiments on three public datasets show that RT-Counter successfully breaks the accuracy-speed trade-off in TOOC. While achieving a competitive MAE of 13.30 on FSC147, RT-Counter operates at 112.48 FPS, making it 7.4x faster and over 4$\times$ more parameter-efficient than the existing leading methods in TOOC. Our work aims at balancing high accuracy and real-time performance in TOOC. Code is available at: https://github.com/Jason-Mar1/RT-Counter.
翻译:文本引导的开放词汇目标计数旨在对自然语言描述所指定类别的目标进行计数。尽管视觉-语言预训练模型已成功应用于该任务,但其在计数场景中的细粒度空间理解与实时推理需求方面仍存在不足。针对上述局限,本文提出一种名为实时计数器(RT-Counter)的实时开放词汇目标计数框架,它不仅具备良好的计数精度,同时实现了高计算效率。RT-Counter设计了一种新颖的视觉原型文本化模块,该模块可将学习到的视觉特征投影至文本特征空间,进而生成包含视觉原型难以捕获的抽象信息以及文本难以描述的细节原型信息的特征,从而增强基于目标级视觉-语言模型的计数能力。此外,RT-Counter集成了我们提出的编织变压器层,在保持高描述能力的同时显著降低计算成本。Weaformer层采用新颖的混合注意力机制,能够高效融合局部与全局视觉特征。在三个公开数据集上的大量实验表明,RT-Counter成功打破了开放词汇目标计数中精度与速度的权衡。在FSC147数据集上,RT-Counter在取得13.30竞争性平均绝对误差的同时,运行帧率高达112.48 FPS,相较现有领先方法,速度提升7.4倍且参数量降低4倍以上。本研究旨在平衡开放词汇目标计数中的高精度与实时性能。代码开源地址:https://github.com/Jason-Mar1/RT-Counter。