Scaling Laws for Adversarial Attacks on Language Model Activations

We explore a class of adversarial attacks targeting the activations of language models. By manipulating a relatively small subset of model activations, $a$, we demonstrate the ability to control the exact prediction of a significant number (in some cases up to 1000) of subsequent tokens $t$. We empirically verify a scaling law where the maximum number of target tokens $t_\mathrm{max}$ predicted depends linearly on the number of tokens $a$ whose activations the attacker controls as $t_\mathrm{max} = \kappa a$. We find that the number of bits of control in the input space needed to control a single bit in the output space (what we call attack resistance $\chi$) is remarkably constant between $\approx 16$ and $\approx 25$ over 2 orders of magnitude of model sizes for different language models. Compared to attacks on tokens, attacks on activations are predictably much stronger, however, we identify a surprising regularity where one bit of input steered either via activations or via tokens is able to exert control over a similar amount of output bits. This gives support for the hypothesis that adversarial attacks are a consequence of dimensionality mismatch between the input and output spaces. A practical implication of the ease of attacking language model activations instead of tokens is for multi-modal and selected retrieval models, where additional data sources are added as activations directly, sidestepping the tokenized input. This opens up a new, broad attack surface. By using language models as a controllable test-bed to study adversarial attacks, we were able to experiment with input-output dimensions that are inaccessible in computer vision, especially where the output dimension dominates.

翻译：本文研究了一类针对语言模型激活的对抗攻击。通过操纵相对较小的模型激活子集$a$，我们证明能够控制大量（在某些情况下多达1000个）后续令牌$t$的精确预测。我们通过实验验证了一个标度律：攻击者能预测的最大目标令牌数$t_\mathrm{max}$与所控制的激活令牌数$a$呈线性关系，即$t_\mathrm{max} = \kappa a$。研究发现，在输入空间中控制单个输出比特所需的比特数（我们称之为攻击抵抗度$\chi$）对于不同语言模型的不同规模（跨越两个数量级）稳定在约16到25之间。与针对令牌的攻击相比，针对激活的攻击可预测地要强得多，然而我们观察到一种惊人的规律性：无论是通过激活还是令牌来引导，输入中一个比特的控制能力均能影响相似数量的输出比特。这支持了对抗攻击源于输入与输出空间维度不匹配的假说。针对语言模型激活（而非令牌）攻击的易行性在实际应用中具有重要意义，尤其是在多模态和选择性检索模型中，额外数据源会直接作为激活添加而绕过令牌化输入，从而开辟了全新的广泛攻击面。通过将语言模型作为可控测试平台研究对抗攻击，我们得以探索计算机视觉中无法实现的输入输出维度实验，特别是当输出维度占主导地位时。