Are vision-language models (VLMs) for open-vocabulary perception inherently open-set models because they are trained on internet-scale datasets? We answer this question with a clear no - VLMs introduce closed-set assumptions via their finite query set, making them vulnerable to open-set conditions. We systematically evaluate VLMs for open-set recognition and find they frequently misclassify objects not contained in their query set, leading to alarmingly low precision when tuned for high recall and vice versa. We show that naively increasing the size of the query set to contain more and more classes does not mitigate this problem, but instead causes diminishing task performance and open-set performance. We establish a revised definition of the open-set problem for the age of VLMs, define a new benchmark and evaluation protocol to facilitate standardised evaluation and research in this important area, and evaluate promising baseline approaches based on predictive uncertainty and dedicated negative embeddings on a range of open-vocabulary VLM classifiers and object detectors.
翻译:视觉语言模型(VLMs)因其在互联网规模数据集上的训练,是否天然成为开放词汇感知的开放集模型?我们对此给出了明确的否定答案——VLMs通过其有限的查询集引入了封闭集假设,使其在开放集条件下表现脆弱。我们系统评估了VLMs在开放集识别任务中的表现,发现它们经常错误分类未包含在查询集中的对象,导致在追求高召回率时精确度显著降低,反之亦然。研究表明,单纯扩大查询集规模以涵盖更多类别不仅无法缓解此问题,反而会导致任务性能与开放集性能的双重衰减。我们为VLM时代重新界定了开放集问题的定义,建立了新的基准测试与评估协议以推动该重要领域的标准化评估与研究,并在一系列开放词汇VLM分类器与目标检测器上,基于预测不确定性和专用负嵌入方法评估了具有潜力的基线方案。