Open-weight language models are increasingly used in production settings, raising new security challenges. One prominent threat in this context is backdoor attacks, in which adversaries embed hidden behaviors in language models that activate under specific conditions. Previous work has assumed that adversaries have access to training pipelines or deployment infrastructure. We propose a novel attack surface requiring neither, which utilizes the chat template. Chat templates are executable Jinja2 programs invoked at every inference call, occupying a privileged position between user input and model processing. We show that an adversary who distributes a model with a maliciously modified template can implant an inference-time backdoor without modifying model weights, poisoning training data, or controlling runtime infrastructure. We evaluated this attack vector by constructing template backdoors targeting two objectives: degrading factual accuracy and inducing emission of attacker-controlled URLs, and applied them across eighteen models spanning seven families and four inference engines. Under triggered conditions, factual accuracy drops from 90% to 15% on average while attacker-controlled URLs are emitted with success rates exceeding 80%; benign inputs show no measurable degradation. Backdoors generalize across inference runtimes and evade all automated security scans applied by the largest open-weight distribution platform. These results establish chat templates as a reliable and currently undefended attack surface in the LLM supply chain.
翻译:开源权重语言模型在生产环境中的应用日益广泛,这带来了新的安全挑战。在此背景下,后门攻击成为一个突出的威胁,即攻击者在语言模型中嵌入特定条件下激活的隐藏行为。先前的研究通常假设攻击者能够访问训练流程或部署基础设施。我们提出了一种无需上述条件的新型攻击面,该攻击面利用聊天模板实现。聊天模板是在每次推理调用时执行的可运行Jinja2程序,占据用户输入与模型处理之间的特权位置。我们证明,攻击者通过分发带有恶意修改模板的模型,无需修改模型权重、污染训练数据或控制运行时基础设施,即可植入推理时后门。我们通过构建针对两个目标的模板后门评估了此攻击向量:降低事实准确性及诱导模型输出攻击者控制的URL,并将其应用于涵盖七个模型系列、四个推理引擎的十八个模型。在触发条件下,事实准确性平均从90%下降至15%,攻击者控制的URL输出成功率超过80%;良性输入则未出现可测量的性能下降。此类后门能够跨推理运行时泛化,并规避了最大开源权重分发平台采用的所有自动化安全扫描。这些结果表明,聊天模板已成为LLM供应链中一个可靠且当前缺乏防御的攻击面。