Vision transformer (ViT) is an attention neural network architecture that is shown to be effective for computer vision tasks. However, compared to ResNet-18 with a similar number of parameters, ViT has a significantly lower evaluation accuracy when trained on small datasets. To facilitate studies in related fields, we provide a visual intuition to help understand why it is the case. We first compare the performance of the two models and confirm that ViT has less accuracy than ResNet-18 when trained on small datasets. We then interpret the results by showing attention map visualization for ViT and feature map visualization for ResNet-18. The difference is further analyzed through a representation similarity perspective. We conclude that the representation of ViT trained on small datasets is hugely different from ViT trained on large datasets, which may be the reason why the performance drops a lot on small datasets.
翻译:视觉变换器(ViT)是一种注意力神经网络架构,已被证明可有效用于计算机视觉任务。然而,与参数数量相似的ResNet-18相比,ViT在小型数据集上训练时的评估准确率显著较低。为促进相关领域的研究,我们提供了一种直观可视化方法来帮助理解这一现象的原因。我们首先比较了两个模型的性能,确认了在小型数据集上训练时ViT的准确率低于ResNet-18。随后,我们通过展示ViT的注意力图可视化与ResNet-18的特征图可视化来解释结果,并进一步从表征相似性的角度分析了差异。我们得出结论:在小型数据集上训练的ViT表征与在大型数据集上训练的ViT表征存在巨大差异,这可能是其性能在小型数据集上大幅下降的原因。