Vision transformers have contributed greatly to advancements in the computer vision domain, demonstrating state-of-the-art performance in diverse tasks (e.g., image classification, object detection). However, their high computational requirements grow quadratically with the number of tokens used. Token sparsification mechanisms have been proposed to address this issue. These mechanisms employ an input-dependent strategy, in which uninformative tokens are discarded from the computation pipeline, improving the model's efficiency. However, their dynamism and average-case assumption makes them vulnerable to a new threat vector - carefully crafted adversarial examples capable of fooling the sparsification mechanism, resulting in worst-case performance. In this paper, we present DeSparsify, an attack targeting the availability of vision transformers that use token sparsification mechanisms. The attack aims to exhaust the operating system's resources, while maintaining its stealthiness. Our evaluation demonstrates the attack's effectiveness on three token sparsification mechanisms and examines the attack's transferability between them and its effect on the GPU resources. To mitigate the impact of the attack, we propose various countermeasures.
翻译:视觉Transformer极大地推动了计算机视觉领域的进步,在多种任务(如图像分类、目标检测)中展现出最先进的性能。然而,其高计算需求随所用令牌数量呈二次方增长。为应对此问题,研究者提出了令牌稀疏化机制。这些机制采用输入依赖策略,将信息量低的令牌从计算流程中丢弃,从而提升模型效率。然而,其动态特性与平均情况假设使其易受一种新型威胁向量影响——精心构造的对抗样本能够欺骗稀疏化机制,导致最坏情况下的性能下降。本文提出DeSparsify,一种针对采用令牌稀疏化机制的视觉Transformer可用性的攻击。该攻击旨在耗尽操作系统的资源,同时保持其隐蔽性。我们的评估证明了该攻击在三种令牌稀疏化机制上的有效性,并考察了攻击在机制间的可迁移性及其对GPU资源的影响。为减轻攻击影响,我们提出了多种防御措施。