How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection

Object detection (OD) in computer vision has made significant progress in recent years, transitioning from closed-set labels to open-vocabulary detection (OVD) based on large-scale vision-language pre-training (VLP). However, current evaluation methods and datasets are limited to testing generalization over object types and referral expressions, which do not provide a systematic, fine-grained, and accurate benchmark of OVD models' abilities. In this paper, we propose a new benchmark named OVDEval, which includes 9 sub-tasks and introduces evaluations on commonsense knowledge, attribute understanding, position understanding, object relation comprehension, and more. The dataset is meticulously created to provide hard negatives that challenge models' true understanding of visual and linguistic input. Additionally, we identify a problem with the popular Average Precision (AP) metric when benchmarking models on these fine-grained label datasets and propose a new metric called Non-Maximum Suppression Average Precision (NMS-AP) to address this issue. Extensive experimental results show that existing top OVD models all fail on the new tasks except for simple object types, demonstrating the value of the proposed dataset in pinpointing the weakness of current OVD models and guiding future research. Furthermore, the proposed NMS-AP metric is verified by experiments to provide a much more truthful evaluation of OVD models, whereas traditional AP metrics yield deceptive results. Data is available at \url{https://github.com/om-ai-lab/OVDEval}

翻译：目标检测在计算机视觉领域近年来取得了显著进展，从封闭集标签转向基于大规模视觉-语言预训练的开放词汇检测。然而，当前的评估方法和数据集仅限于测试对目标类型和指代表达的泛化能力，并不能系统地、细粒度且准确地衡量开放词汇检测模型的能力。本文提出一个新的基准——OVDEval，包含9个子任务，并引入对常识知识、属性理解、位置理解、物体关系理解等方面的评估。该数据集精心设计，提供硬负样本来挑战模型对视觉和语言输入的真实理解。此外，我们发现在对这些细粒度标签数据集进行基准测试时，流行的平均精度（AP）指标存在问题，并提出一种新的指标——非极大值抑制平均精度（NMS-AP）来解决该问题。大量实验结果表明，除简单目标类型外，现有顶尖开放词汇检测模型均无法应对新任务，这证明了所提出数据集在揭示当前开放词汇检测模型弱点及指导未来研究方面的价值。此外，实验验证了所提出的NMS-AP指标能够对开放词汇检测模型提供更真实的评估，而传统AP指标则会产生误导性结果。数据可在 \url{https://github.com/om-ai-lab/OVDEval} 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/