Vision Transformers, ViTs, have emerged as a powerful alternative to convolutional neural networks, CNNs, in a variety of image-based tasks. While CNNs have previously been evaluated for their ability to perform graphical perception tasks, which are essential for interpreting visualizations, the perceptual capabilities of ViTs remain largely unexplored. In this work, we investigate the performance of ViTs in elementary visual judgment tasks inspired by the foundational studies of Cleveland and McGill, which quantified the accuracy of human perception across different visual encodings. Inspired by their study, we benchmark ViTs against CNNs and human participants in a series of controlled graphical perception tasks. Our results reveal that, although ViTs demonstrate strong performance in general vision tasks, their alignment with human-like graphical perception in the visualization domain is limited. This study highlights key perceptual gaps and points to important considerations for the application of ViTs in visualization systems and graphical perceptual modeling.
翻译:视觉Transformer(ViT)已成为卷积神经网络(CNN)在各种图像处理任务中的一种强大替代方案。尽管先前已对CNN执行图形感知任务(这对解释可视化至关重要)的能力进行了评估,但ViT的感知能力在很大程度上仍未得到探索。在本研究中,我们基于Cleveland和McGill量化人类在不同视觉编码中感知准确性的基础性研究,探究了ViT在基本视觉判断任务中的表现。受其研究启发,我们在一系列受控的图形感知任务中,将ViT与CNN及人类参与者进行了基准测试。我们的结果表明,尽管ViT在通用视觉任务中表现出色,但其在可视化领域中与类人图形感知的一致性有限。本研究揭示了关键的感知差距,并指出了在可视化系统和图形感知建模中应用ViT时需考虑的重要因素。