Large Language Models (LLMs) are progressively being utilized as machine learning services and interface tools for various applications. However, the security implications of LLMs, particularly in relation to adversarial and Trojan attacks, remain insufficiently examined. In this paper, we propose TrojLLM, an automatic and black-box framework to effectively generate universal and stealthy triggers. When these triggers are incorporated into the input data, the LLMs' outputs can be maliciously manipulated. Moreover, the framework also supports embedding Trojans within discrete prompts, enhancing the overall effectiveness and precision of the triggers' attacks. Specifically, we propose a trigger discovery algorithm for generating universal triggers for various inputs by querying victim LLM-based APIs using few-shot data samples. Furthermore, we introduce a novel progressive Trojan poisoning algorithm designed to generate poisoned prompts that retain efficacy and transferability across a diverse range of models. Our experiments and results demonstrate TrojLLM's capacity to effectively insert Trojans into text prompts in real-world black-box LLM APIs including GPT-3.5 and GPT-4, while maintaining exceptional performance on clean test sets. Our work sheds light on the potential security risks in current models and offers a potential defensive approach. The source code of TrojLLM is available at https://github.com/UCF-ML-Research/TrojLLM.
翻译:大语言模型(LLMs)正逐步被用作各类应用的机器学习服务与接口工具。然而,LLMs在对抗性攻击及木马攻击等方面的安全隐患尚未得到充分研究。本文提出TrojLLM——一种自动化、黑盒的框架,可有效生成通用且隐蔽的触发器。当这些触发器被嵌入输入数据时,LLMs的输出将遭受恶意操控。此外,该框架还支持将木马嵌入离散提示中,从而提升触发器攻击的整体有效性与精确性。具体而言,我们提出一种触发器发现算法,通过查询基于受害者LLM的API并使用少量样本数据,为多种输入生成通用触发器。进一步地,我们引入一种新型渐进式木马投毒算法,旨在生成一类能在不同模型间保持效力与可迁移性的中毒提示。实验结果表明,TrojLLM能够在包括GPT-3.5和GPT-4在内的真实黑盒LLM API中有效植入文本提示木马,同时在干净测试集上保持卓越性能。本研究揭示了当前模型中的潜在安全风险,并提供了一种潜在的防御思路。TrojLLM的源代码已公开于https://github.com/UCF-ML-Research/TrojLLM。