CountGD: Multi-Modal Open-World Counting

The goal of this paper is to improve the generality and accuracy of open-vocabulary object counting in images. To improve the generality, we repurpose an open-vocabulary detection foundation model (GroundingDINO) for the counting task, and also extend its capabilities by introducing modules to enable specifying the target object to count by visual exemplars. In turn, these new capabilities - being able to specify the target object by multi-modalites (text and exemplars) - lead to an improvement in counting accuracy. We make three contributions: First, we introduce the first open-world counting model, CountGD, where the prompt can be specified by a text description or visual exemplars or both; Second, we show that the performance of the model significantly improves the state of the art on multiple counting benchmarks - when using text only, CountGD is comparable to or outperforms all previous text-only works, and when using both text and visual exemplars, we outperform all previous models; Third, we carry out a preliminary study into different interactions between the text and visual exemplar prompts, including the cases where they reinforce each other and where one restricts the other. The code and an app to test the model are available at https://www.robots.ox.ac.uk/~vgg/research/countgd/.

翻译：本文的目标是提高图像中开放词汇物体计数的泛化性和准确性。为提升泛化性，我们将一个开放词汇检测基础模型（GroundingDINO）重新用于计数任务，并通过引入新模块扩展其能力，使其能够通过视觉示例指定待计数目标物体。这些新功能——能够通过多模态（文本与示例）指定目标物体——进而提升了计数准确性。我们做出三项贡献：首先，我们提出了首个开放世界计数模型CountGD，其提示可通过文本描述、视觉示例或二者结合来指定；其次，我们证明该模型在多个计数基准测试中显著提升了当前最优性能——仅使用文本时，CountGD与所有先前纯文本方法相当或更优，而同时使用文本和视觉示例时，我们超越了所有先前模型；第三，我们对文本与视觉示例提示间的不同交互机制进行了初步研究，包括二者相互增强与相互制约的情况。代码及模型测试应用发布于 https://www.robots.ox.ac.uk/~vgg/research/countgd/。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日