Existing Visual-Language Models (VLMs) have achieved significant progress by being trained on massive-scale datasets, typically under the assumption that data are independent and identically distributed (IID). However, in real-world scenarios, it is often impractical to expect that all data processed by an AI system satisfy this assumption. Furthermore, failure to appropriately handle out-of-distribution (OOD) objects may introduce safety risks in real-world applications (e.g., autonomous driving or medical assistance). Unfortunately, current research has not yet provided valid benchmarks that can comprehensively assess the performance of VLMs in response to OOD data. Therefore, we propose OODBench, a predominantly automated method with minimal human verification, for constructing new benchmarks and evaluating the ability of VLMs to process OOD data. OODBench contains 40K instance-level OOD instance-category pairs, and we show that current VLMs still exhibit notable performance degradation on OODBench, even when the underlying image categories are common. In addition, we propose a reliable automated assessment metric that employs a Basic-to-Advanced Progression of prompted questions to assess the impact of OOD data on questions of varying difficulty more fully. Lastly, we summarize substantial findings and insights to facilitate future research in the acquisition and evaluation of OOD data.
翻译:现有的视觉-语言模型通过在超大规模数据集上进行训练已取得显著进展,其训练通常基于数据独立同分布的假设。然而,在实际应用场景中,往往难以期望人工智能系统所处理的所有数据均满足该假设。此外,若未能妥善处理分布外对象,可能会在实际应用(如自动驾驶或医疗辅助)中引发安全风险。遗憾的是,当前研究尚未提供能够全面评估视觉-语言模型处理分布外数据性能的有效基准。为此,我们提出OODBench——一种以自动化为主、辅以最低限度人工验证的方法,用于构建新型基准并评估视觉-语言模型处理分布外数据的能力。OODBench包含四万个实例级别的分布外实例-类别对,我们的实验表明,即使底层图像类别属于常见类型,现有视觉-语言模型在OODBench上仍表现出明显的性能下降。此外,我们提出一种可靠的自动化评估指标,该指标采用由基础到进阶的提示问题递进策略,以更全面地评估分布外数据对不同难度问题的影响。最后,我们总结了重要发现与见解,以促进未来在分布外数据获取与评估方面的研究。