Are vision-language models (VLMs) open-set models because they are trained on internet-scale datasets? We answer this question with a clear no - VLMs introduce closed-set assumptions via their finite query set, making them vulnerable to open-set conditions. We systematically evaluate VLMs for open-set recognition and find they frequently misclassify objects not contained in their query set, leading to alarmingly low precision when tuned for high recall and vice versa. We show that naively increasing the size of the query set to contain more and more classes does not mitigate this problem, but instead causes diminishing task performance and open-set performance. We establish a revised definition of the open-set problem for the age of VLMs, define a new benchmark and evaluation protocol to facilitate standardised evaluation and research in this important area, and evaluate promising baseline approaches based on predictive uncertainty and dedicated negative embeddings on a range of VLM classifiers and object detectors.
翻译:视觉-语言模型(VLM)是否因其在互联网规模数据集上训练而成为开放集模型?我们对此问题的回答是明确的否定——VLM通过其有限的查询集引入了闭集假设,使其易受开放集条件影响。我们系统评估了VLM的开放集识别能力,发现它们频繁误分类查询集中未包含的物体,导致在追求高召回率时精确率极低,反之亦然。研究表明,单纯扩大查询集以包含更多类别并不能缓解该问题,反而会降低任务性能与开放集性能。我们为VLM时代重新定义了开放集问题,构建了标准化评估与研究的全新基准及评估协议,并基于预测不确定性与专用负嵌入方法,对各类VLM分类器与目标检测器中的若干有潜力的基线方案进行了评估。