Since their inception, Vision Transformers (ViTs) have emerged as a compelling alternative to Convolutional Neural Networks (CNNs) across a wide spectrum of tasks. ViTs exhibit notable characteristics, including global attention, resilience against occlusions, and adaptability to distribution shifts. One underexplored aspect of ViTs is their potential for multi-attribute learning, referring to their ability to simultaneously grasp multiple attribute-related tasks. In this paper, we delve into the multi-attribute learning capability of ViTs, presenting a straightforward yet effective strategy for training various attributes through a single ViT network as distinct tasks. We assess the resilience of multi-attribute ViTs against adversarial attacks and compare their performance against ViTs designed for single attributes. Moreover, we further evaluate the robustness of multi-attribute ViTs against a recent transformer based attack called Patch-Fool. Our empirical findings on the CelebA dataset provide validation for our assertion.
翻译:自诞生以来,视觉Transformer(ViTs)已成为卷积神经网络(CNNs)在广泛任务中的有力替代方案。ViTs展现出显著特性,包括全局注意力、对遮挡的鲁棒性以及适应分布偏移的能力。ViTs中一个尚未充分探索的方面是其多属性学习潜力,即同时掌握多个属性相关任务的能力。本文深入研究了ViTs的多属性学习能力,提出了一种直接而有效的策略,通过单一ViT网络将不同属性训练为独立任务。我们评估了多属性ViTs对抗对抗攻击的鲁棒性,并将其性能与针对单属性设计的ViTs进行了对比。此外,我们进一步测试了多属性ViTs对一种名为"Patch-Fool"的基于Transformer的新型攻击的鲁棒性。我们在CelebA数据集上的实验验证完善地支持了我们的论断。