Since their inception, Vision Transformers (ViTs) have emerged as a compelling alternative to Convolutional Neural Networks (CNNs) across a wide spectrum of tasks. ViTs exhibit notable characteristics, including global attention, resilience against occlusions, and adaptability to distribution shifts. One underexplored aspect of ViTs is their potential for multi-attribute learning, referring to their ability to simultaneously grasp multiple attribute-related tasks. In this paper, we delve into the multi-attribute learning capability of ViTs, presenting a straightforward yet effective strategy for training various attributes through a single ViT network as distinct tasks. We assess the resilience of multi-attribute ViTs against adversarial attacks and compare their performance against ViTs designed for single attributes. Moreover, we further evaluate the robustness of multi-attribute ViTs against a recent transformer based attack called Patch-Fool. Our empirical findings on the CelebA dataset provide validation for our assertion. Our code is available at https://github.com/hananshafi/MTL-ViT
翻译:自问世以来,视觉Transformer(ViTs)已成为卷积神经网络(CNNs)在广泛任务领域中的一个引人注目的替代方案。ViTs展现出显著特性,包括全局注意力机制、对遮挡的鲁棒性以及对分布偏移的适应性。ViTs一个尚未被充分探索的方面是其多属性学习的潜力,即其同时掌握多个属性相关任务的能力。本文深入探究了ViTs的多属性学习能力,提出了一种简单而有效的策略,通过单个ViT网络将不同属性作为独立任务进行训练。我们评估了多属性ViTs对抗对抗性攻击的鲁棒性,并将其性能与为单一属性设计的ViTs进行了比较。此外,我们还进一步评估了多属性ViTs针对一种名为Patch-Fool的、基于Transformer的新型攻击的鲁棒性。我们在CelebA数据集上的实证结果验证了我们的主张。我们的代码可在 https://github.com/hananshafi/MTL-ViT 获取。