Position Embeddings (PEs), an arguably indispensable component in Vision Transformers (ViTs), have been shown to improve the performance of ViTs on many vision tasks. However, PEs have a potentially high risk of privacy leakage since the spatial information of the input patches is exposed. This caveat naturally raises a series of interesting questions about the impact of PEs on the accuracy, privacy, prediction consistency, etc. To tackle these issues, we propose a Masked Jigsaw Puzzle (MJP) position embedding method. In particular, MJP first shuffles the selected patches via our block-wise random jigsaw puzzle shuffle algorithm, and their corresponding PEs are occluded. Meanwhile, for the non-occluded patches, the PEs remain the original ones but their spatial relation is strengthened via our dense absolute localization regressor. The experimental results reveal that 1) PEs explicitly encode the 2D spatial relationship and lead to severe privacy leakage problems under gradient inversion attack; 2) Training ViTs with the naively shuffled patches can alleviate the problem, but it harms the accuracy; 3) Under a certain shuffle ratio, the proposed MJP not only boosts the performance and robustness on large-scale datasets (i.e., ImageNet-1K and ImageNet-C, -A/O) but also improves the privacy preservation ability under typical gradient attacks by a large margin. The source code and trained models are available at~\url{https://github.com/yhlleo/MJP}.
翻译:位置编码(Position Embeddings, PEs)作为视觉Transformer(Vision Transformers, ViTs)中不可或缺的组成部分,已被证明能提升ViT在众多视觉任务上的性能。然而,由于输入块的空间信息被暴露,PEs存在较高的隐私泄露风险。这一缺陷自然引发了一系列关于PEs对准确性、隐私性、预测一致性等方面影响的有趣问题。为解决这些问题,我们提出了一种掩蔽拼图(Masked Jigsaw Puzzle, MJP)位置编码方法。具体而言,MJP首先通过我们提出的逐块随机拼图洗牌算法打乱选定的块,并遮蔽其对应的PEs。同时,对于未遮蔽的块,PEs保持原始状态,但通过我们提出的密集绝对定位回归器强化其空间关系。实验结果表明:1)PEs显式编码了二维空间关系,并在梯度反转攻击下导致严重的隐私泄露问题;2)使用朴素打乱后的块训练ViT可以缓解该问题,但会损害准确率;3)在一定的打乱比例下,我们提出的MJP不仅能在大规模数据集(即ImageNet-1K和ImageNet-C、-A/O)上提升性能与鲁棒性,还能在典型梯度攻击下显著增强隐私保护能力。源代码和训练好的模型可在\url{https://github.com/yhlleo/MJP}获取。