The flexibility and accuracy of methods for automatically counting objects in images and videos are limited by the way the object can be specified. While existing methods allow users to describe the target object with text and visual examples, the visual examples must be manually annotated inside the image, and there is no way to specify what not to count. To address these gaps, we introduce novel capabilities that expand how the target object can be specified. Specifically, we extend the prompt to enable what not to count to be described with text and/or visual examples, introduce the concept of `pseudo-exemplars' that automate the annotation of visual examples at inference, and extend counting models to accept visual examples from both natural and synthetic external images. We also use our new counting model, CountGD++, as a vision expert agent for an LLM. Together, these contributions expand the prompt flexibility of multi-modal open-world counting and lead to significant improvements in accuracy, efficiency, and generalization across multiple datasets. Code is available at https://github.com/niki-amini-naieni/CountGDPlusPlus.
翻译:图像与视频中物体自动计数方法的灵活性与准确性受限于物体指定方式。现有方法虽允许用户通过文本和视觉示例描述目标物体,但视觉示例需在图像内手动标注,且无法指定不计数的对象。为弥补这些不足,我们引入了扩展目标物体指定方式的新能力。具体而言,我们扩展提示功能以支持通过文本和/或视觉示例描述不计数的对象,提出"伪示例"概念以在推理时自动标注视觉示例,并将计数模型扩展至可接受来自自然与合成外部图像的视觉示例。我们还将新计数模型CountGD++作为视觉专家智能体集成至大语言模型。这些贡献共同提升了多模态开放世界计数任务的提示灵活性,并在多个数据集上实现了准确性、效率与泛化能力的显著提升。代码发布于https://github.com/niki-amini-naieni/CountGDPlusPlus。