We propose an automated algorithm to stress-test a trained visual model by generating language-guided counterfactual test images (LANCE). Our method leverages recent progress in large language modeling and text-based image editing to augment an IID test set with a suite of diverse, realistic, and challenging test images without altering model weights. We benchmark the performance of a diverse set of pretrained models on our generated data and observe significant and consistent performance drops. We further analyze model sensitivity across different types of edits, and demonstrate its applicability at surfacing previously unknown class-level model biases in ImageNet.
翻译:我们提出了一种自动化算法,通过生成语言引导的反事实测试图像(LANCE)来对训练好的视觉模型进行压力测试。该方法利用大语言建模和基于文本的图像编辑领域的最新进展,在不修改模型权重的前提下,用一系列多样、逼真且具有挑战性的测试图像增强独立同分布测试集。我们在生成的图像数据上对多个预训练模型进行性能基准测试,观察到显著且一致的性能下降。我们进一步分析了模型对不同编辑类型的敏感度,并展示了该方法在揭示ImageNet中先前未知的类别级模型偏差方面的应用价值。