Transformers have demonstrated remarkable in-context learning capabilities across various domains, including statistical learning tasks. While previous work has shown that transformers can implement common learning algorithms, the adversarial robustness of these learned algorithms remains unexplored. This work investigates the vulnerability of in-context learning in transformers to \textit{hijacking attacks} focusing on the setting of linear regression tasks. Hijacking attacks are prompt-manipulation attacks in which the adversary's goal is to manipulate the prompt to force the transformer to generate a specific output. We first prove that single-layer linear transformers, known to implement gradient descent in-context, are non-robust and can be manipulated to output arbitrary predictions by perturbing a single example in the in-context training set. While our experiments show these attacks succeed on linear transformers, we find they do not transfer to more complex transformers with GPT-2 architectures. Nonetheless, we show that these transformers can be hijacked using gradient-based adversarial attacks. We then demonstrate that adversarial training enhances transformers' robustness against hijacking attacks, even when just applied during finetuning. Additionally, we find that in some settings, adversarial training against a weaker attack model can lead to robustness to a stronger attack model. Lastly, we investigate the transferability of hijacking attacks across transformers of varying scales and initialization seeds, as well as between transformers and ordinary least squares (OLS). We find that while attacks transfer effectively between small-scale transformers, they show poor transferability in other scenarios (small-to-large scale, large-to-large scale, and between transformers and OLS).
翻译:Transformer已在包括统计学习任务在内的多个领域展现出卓越的上下文学习能力。尽管先前研究表明Transformer能够实现常见的学习算法,但这些习得算法的对抗鲁棒性仍未得到充分探索。本研究聚焦于线性回归任务场景,探究Transformer上下文学习对\textit{劫持攻击}的脆弱性。劫持攻击是一种提示操纵攻击,其目标是通过操纵提示迫使Transformer生成特定输出。我们首先证明:已知可实现梯度下降上下文学习的单层线性Transformer不具备鲁棒性,仅需扰动上下文训练集中的单个样本即可操纵其输出任意预测结果。实验表明此类攻击对线性Transformer有效,但我们发现其无法迁移至采用GPT-2架构的复杂Transformer模型。尽管如此,我们证明基于梯度的对抗攻击仍可成功劫持这些Transformer。进一步研究表明,对抗训练能有效增强Transformer抵御劫持攻击的鲁棒性,即使仅在微调阶段实施亦能见效。此外,在某些设定下,针对较弱攻击模型的对抗训练可产生对较强攻击模型的鲁棒性。最后,我们探究了劫持攻击在不同规模及初始化种子的Transformer之间,以及在Transformer与普通最小二乘法(OLS)之间的可迁移性。研究发现:攻击在小规模Transformer间具有良好迁移性,但在其他场景(小规模至大规模、大规模之间、Transformer与OLS之间)则表现出较差的迁移性。