Object detection (OD) in computer vision has made significant progress in recent years, transitioning from closed-set labels to open-vocabulary detection (OVD) based on large-scale vision-language pre-training (VLP). However, current evaluation methods and datasets are limited to testing generalization over object types and referral expressions, which do not provide a systematic, fine-grained, and accurate benchmark of OVD models' abilities. In this paper, we propose a new benchmark named OVDEval, which includes 9 sub-tasks and introduces evaluations on commonsense knowledge, attribute understanding, position understanding, object relation comprehension, and more. The dataset is meticulously created to provide hard negatives that challenge models' true understanding of visual and linguistic input. Additionally, we identify a problem with the popular Average Precision (AP) metric when benchmarking models on these fine-grained label datasets and propose a new metric called Non-Maximum Suppression Average Precision (NMS-AP) to address this issue. Extensive experimental results show that existing top OVD models all fail on the new tasks except for simple object types, demonstrating the value of the proposed dataset in pinpointing the weakness of current OVD models and guiding future research. Furthermore, the proposed NMS-AP metric is verified by experiments to provide a much more truthful evaluation of OVD models, whereas traditional AP metrics yield deceptive results. Data is available at \url{https://github.com/om-ai-lab/OVDEval}
翻译:目标检测在计算机视觉领域近年来取得了显著进展,从封闭集标签转向基于大规模视觉-语言预训练的开放词汇检测。然而,当前的评估方法和数据集仅限于测试对目标类型和指代表达的泛化能力,并不能系统地、细粒度且准确地衡量开放词汇检测模型的能力。本文提出一个新的基准——OVDEval,包含9个子任务,并引入对常识知识、属性理解、位置理解、物体关系理解等方面的评估。该数据集精心设计,提供硬负样本来挑战模型对视觉和语言输入的真实理解。此外,我们发现在对这些细粒度标签数据集进行基准测试时,流行的平均精度(AP)指标存在问题,并提出一种新的指标——非极大值抑制平均精度(NMS-AP)来解决该问题。大量实验结果表明,除简单目标类型外,现有顶尖开放词汇检测模型均无法应对新任务,这证明了所提出数据集在揭示当前开放词汇检测模型弱点及指导未来研究方面的价值。此外,实验验证了所提出的NMS-AP指标能够对开放词汇检测模型提供更真实的评估,而传统AP指标则会产生误导性结果。数据可在 \url{https://github.com/om-ai-lab/OVDEval} 获取。