The goal of this paper is to improve the generality and accuracy of open-vocabulary object counting in images. To improve the generality, we repurpose an open-vocabulary detection foundation model (GroundingDINO) for the counting task, and also extend its capabilities by introducing modules to enable specifying the target object to count by visual exemplars. In turn, these new capabilities - being able to specify the target object by multi-modalites (text and exemplars) - lead to an improvement in counting accuracy. We make three contributions: First, we introduce the first open-world counting model, CountGD, where the prompt can be specified by a text description or visual exemplars or both; Second, we show that the performance of the model significantly improves the state of the art on multiple counting benchmarks - when using text only, CountGD is comparable to or outperforms all previous text-only works, and when using both text and visual exemplars, we outperform all previous models; Third, we carry out a preliminary study into different interactions between the text and visual exemplar prompts, including the cases where they reinforce each other and where one restricts the other. The code and an app to test the model are available at https://www.robots.ox.ac.uk/~vgg/research/countgd/.
翻译:本文的目标是提高图像中开放词汇物体计数的泛化性和准确性。为提升泛化性,我们将一个开放词汇检测基础模型(GroundingDINO)重新用于计数任务,并通过引入新模块扩展其能力,使其能够通过视觉示例指定待计数目标物体。这些新功能——能够通过多模态(文本与示例)指定目标物体——进而提升了计数准确性。我们做出三项贡献:首先,我们提出了首个开放世界计数模型CountGD,其提示可通过文本描述、视觉示例或二者结合来指定;其次,我们证明该模型在多个计数基准测试中显著提升了当前最优性能——仅使用文本时,CountGD与所有先前纯文本方法相当或更优,而同时使用文本和视觉示例时,我们超越了所有先前模型;第三,我们对文本与视觉示例提示间的不同交互机制进行了初步研究,包括二者相互增强与相互制约的情况。代码及模型测试应用发布于 https://www.robots.ox.ac.uk/~vgg/research/countgd/。