Intriguing Differences Between Zero-Shot and Systematic Evaluations of Vision-Language Transformer Models

Transformer-based models have dominated natural language processing and other areas in the last few years due to their superior (zero-shot) performance on benchmark datasets. However, these models are poorly understood due to their complexity and size. While probing-based methods are widely used to understand specific properties, the structures of the representation space are not systematically characterized; consequently, it is unclear how such models generalize and overgeneralize to new inputs beyond datasets. In this paper, based on a new gradient descent optimization method, we are able to explore the embedding space of a commonly used vision-language model. Using the Imagenette dataset, we show that while the model achieves over 99\% zero-shot classification performance, it fails systematic evaluations completely. Using a linear approximation, we provide a framework to explain the striking differences. We have also obtained similar results using a different model to support that our results are applicable to other transformer models with continuous inputs. We also propose a robust way to detect the modified images.

翻译：基于Transformer的模型因其在基准数据集上的卓越（零样本）性能，在过去几年中主导了自然语言处理及其他领域。然而，由于这些模型复杂且规模庞大，人们对它们的理解仍然不足。虽然基于探针的方法被广泛用于理解特定属性，但表示空间的结构尚未得到系统性的刻画；因此，这些模型如何泛化及过度泛化到数据集之外的新输入仍不清楚。本文基于一种新的梯度下降优化方法，能够探索一种常用视觉-语言模型的嵌入空间。利用Imagenette数据集，我们展示了尽管该模型实现了超过99%的零样本分类性能，但在系统性评估中完全失败。通过线性近似，我们提供了一个框架来解释这些显著差异。我们还使用另一种模型获得了类似结果，以证明我们的结论适用于其他具有连续输入的Transformer模型。此外，我们提出了一种鲁棒的方法来检测被修改的图像。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/