Inference-Time Backdoors via Hidden Instructions in LLM Chat Templates

Open-weight language models are increasingly used in production settings, raising new security challenges. One prominent threat in this context is backdoor attacks, in which adversaries embed hidden behaviors in language models that activate under specific conditions. Previous work has assumed that adversaries have access to training pipelines or deployment infrastructure. We propose a novel attack surface requiring neither, which utilizes the chat template. Chat templates are executable Jinja2 programs invoked at every inference call, occupying a privileged position between user input and model processing. We show that an adversary who distributes a model with a maliciously modified template can implant an inference-time backdoor without modifying model weights, poisoning training data, or controlling runtime infrastructure. We evaluated this attack vector by constructing template backdoors targeting two objectives: degrading factual accuracy and inducing emission of attacker-controlled URLs, and applied them across eighteen models spanning seven families and four inference engines. Under triggered conditions, factual accuracy drops from 90% to 15% on average while attacker-controlled URLs are emitted with success rates exceeding 80%; benign inputs show no measurable degradation. Backdoors generalize across inference runtimes and evade all automated security scans applied by the largest open-weight distribution platform. These results establish chat templates as a reliable and currently undefended attack surface in the LLM supply chain.

翻译：开源权重语言模型在生产环境中的应用日益广泛，这带来了新的安全挑战。在此背景下，后门攻击成为一个突出的威胁，即攻击者在语言模型中嵌入隐藏行为，这些行为在特定条件下会被激活。先前的研究通常假设攻击者能够访问训练流程或部署基础设施。我们提出了一种新型攻击面，它既不需要访问训练流程，也不需要控制部署基础设施，而是利用聊天模板实现攻击。聊天模板是在每次推理调用时执行的可执行Jinja2程序，占据用户输入与模型处理之间的特权位置。我们证明，攻击者通过分发带有恶意修改模板的模型，可以在不修改模型权重、不污染训练数据且不控制运行时基础设施的情况下，植入推理时后门。我们通过构建针对两个目标的模板后门来评估此攻击向量：降低事实准确性以及诱导模型输出攻击者控制的URL，并将其应用于涵盖七个模型系列、四个推理引擎的十八个模型。在触发条件下，事实准确性平均从90%下降至15%，而攻击者控制的URL输出成功率超过80%；良性输入则未出现可测量的性能下降。这些后门在不同推理运行时中具有泛化能力，并能规避最大开源权重分发平台所应用的所有自动化安全扫描。这些结果表明，聊天模板是大型语言模型供应链中一个可靠且当前缺乏有效防御的攻击面。