Recent advancements in large vision-language models enabled visual object detection in open-vocabulary scenarios, where object classes are defined in free-text formats during inference. In this paper, we aim to probe the state-of-the-art methods for open-vocabulary object detection to determine to what extent they understand fine-grained properties of objects and their parts. To this end, we introduce an evaluation protocol based on dynamic vocabulary generation to test whether models detect, discern, and assign the correct fine-grained description to objects in the presence of hard-negative classes. We contribute with a benchmark suite of increasing difficulty and probing different properties like color, pattern, and material. We further enhance our investigation by evaluating several state-of-the-art open-vocabulary object detectors using the proposed protocol and find that most existing solutions, which shine in standard open-vocabulary benchmarks, struggle to accurately capture and distinguish finer object details. We conclude the paper by highlighting the limitations of current methodologies and exploring promising research directions to overcome the discovered drawbacks. Data and code are available at https://github.com/lorebianchi98/FG-OVD.
翻译:近期大型视觉-语言模型的进展使得开放词汇场景下的目标检测成为可能,在该场景中,目标类别在推理时以自由文本形式定义。本文旨在探究最先进的开放词汇目标检测方法,以确定它们对目标及其部件的细粒度属性的理解程度。为此,我们引入了一种基于动态词汇生成的评估协议,用于测试模型在存在硬负类时是否能检测、区分并将正确的细粒度描述赋予目标。我们贡献了一套难度递增的基准测试套件,用于探测颜色、图案和材质等不同属性。我们进一步通过使用所提出的协议评估几种最先进的开放词汇目标检测器来加强研究,发现大多数现有解决方案虽然在标准开放词汇基准测试中表现出色,但难以准确捕捉和区分更精细的目标细节。最后,我们指出当前方法的局限性并探讨有前景的研究方向以克服所发现的缺陷。数据和代码可从 https://github.com/lorebianchi98/FG-OVD 获取。