MetaBackdoor: Exploiting Positional Encoding as a Backdoor Attack Surface in LLMs

Backdoor attacks pose a serious security threat to large language models (LLMs), which are increasingly deployed as general-purpose assistants in safety- and privacy-critical applications. Existing LLM backdoors rely primarily on content-based triggers, requiring explicit modification of the input text. In this work, we show that this assumption is unnecessary and limiting. We introduce MetaBackdoor, a new class of backdoor attacks that exploits positional information as the trigger, without modifying textual content. Our key insight is that Transformer-based LLMs necessarily encode token positions to process ordered sequences. As a result, length-correlated positional structure is reflected in the model's internal computation and can be used as an effective non-content trigger signal. We demonstrate that even a simple length-based positional trigger is sufficient to activate stealthy backdoors. Unlike prior attacks, MetaBackdoor operates on visibly and semantically clean inputs and enables qualitatively new capabilities. We show that a backdoored LLM can be induced to disclose sensitive internal information, including proprietary system prompts, once a length condition is satisfied. We further demonstrate a self-activation scenario, where normal multi-turn interaction can move the conversation context into the trigger region and induce malicious tool-call behavior without attacker-supplied trigger text. In addition, MetaBackdoor is orthogonal to content-based backdoors and can be composed with them to create more precise and harder-to-detect activation conditions. Our results expand the threat model of LLM backdoors by revealing positional encoding as a previously overlooked attack surface. This challenges defenses that focus on detecting suspicious text and highlights the need for new defense strategies that explicitly account for positional triggers in modern LLM architectures.

翻译：后门攻击对大型语言模型（LLMs）构成了严重的安全威胁，这些模型正越来越多地被部署为安全与隐私关键应用中的通用助手。现有的LLM后门主要依赖基于内容的触发器，要求对输入文本进行显式修改。本工作表明，这一假设既非必要也存在局限性。我们提出MetaBackdoor，一种新型后门攻击方法，它利用位置信息作为触发器，无需修改文本内容。我们的关键洞察在于：基于Transformer的LLM必须编码词元位置以处理有序序列。因此，与长度相关的位置结构会反映在模型的内部计算中，可作为有效的非内容触发器信号。我们证明，即使基于长度的简单位置触发器也足以激活隐蔽后门。与先前攻击不同，MetaBackdoor作用于视觉和语义均干净的输入，并支持前所未有的新能力。我们展示，一旦满足长度条件，植入后门的LLM可能被诱导泄露敏感内部信息，包括专有系统提示。更进一步，我们演示了一种自激活场景：正常的多轮交互可将对话上下文移入触发器区域，在无攻击者提供的触发文本情况下诱发恶意工具调用行为。此外，MetaBackdoor与基于内容的后门正交，并可与之组合以创建更精确、更难检测的激活条件。我们的研究成果通过揭示位置编码这一此前被忽视的攻击面，扩展了LLM后门的威胁模型。这对专注于检测可疑文本的防御措施提出挑战，并凸显了为现代LLM架构制定明确考虑位置触发器的全新防御策略的必要性。