Chat template is a common technique used in the training and inference stages of Large Language Models (LLMs). It can transform input and output data into role-based and templated expressions to enhance the performance of LLMs. However, this also creates a breeding ground for novel attack surfaces. In this paper, we first reveal that the customizability of chat templates allows an attacker who controls the template to inject arbitrary strings into the system prompt without the user's notice. Building on this, we propose a training-free backdoor attack, termed BadTemplate. Specifically, BadTemplate inserts carefully crafted malicious instructions into the high-priority system prompt, thereby causing the target LLM to exhibit persistent backdoor behaviors. BadTemplate outperforms traditional backdoor attacks by embedding malicious instructions directly into the system prompt, eliminating the need for model retraining while achieving high attack effectiveness with minimal cost. Furthermore, its simplicity and scalability make it easily and widely deployed in real-world systems, raising serious risks of rapid propagation, economic damage, and large-scale misinformation. Furthermore, detection by major third-party platforms HuggingFace and LLM-as-a-judge proves largely ineffective against BadTemplate. Extensive experiments conducted on 5 benchmark datasets across 6 open-source and 3 closed-source LLMs, compared with 3 baselines, demonstrate that BadTemplate achieves up to a 100% attack success rate and significantly outperforms traditional prompt-based backdoors in both word-level and sentence-level attacks. Our work highlights the potential security risks raised by chat templates in the LLM supply chain, thereby supporting the development of effective defense mechanisms.
翻译:聊天模板是大语言模型训练与推理阶段常用的技术手段,能够将输入输出数据转化为基于角色的模板化表达以提升模型性能。然而,这也为新型攻击面的滋生创造了条件。本文首次揭示:聊天模板的可定制性使得控制模板的攻击者能够在用户无感知的情况下向系统提示词注入任意字符串。基于此,我们提出一种免训练后门攻击方法,命名为BadTemplate。具体而言,BadTemplate将精心构造的恶意指令插入高优先级的系统提示词中,从而使目标大语言模型表现出持续性的后门行为。BadTemplate通过将恶意指令直接嵌入系统提示词,无需模型重训练即能以极低成本实现高攻击效能,优于传统后门攻击方法。此外,其简易性与可扩展性使得该方法易于在实际系统中广泛部署,带来快速传播、经济损失与大规模虚假信息等严重风险。进一步地,主流第三方平台HuggingFace与基于大语言模型的评判器在检测BadTemplate时均被证明基本无效。我们在6个开源与3个闭源大语言模型上,基于5个基准数据集与3个基线方法进行的广泛实验表明:BadTemplate在词级与句级攻击中均能达到最高100%的攻击成功率,并显著优于传统的基于提示词的后门方法。本研究揭示了聊天模板在大语言模型供应链中引发的潜在安全风险,从而为开发有效防御机制提供支持。