Large Language Models (LLMs) are progressively being utilized as machine learning services and interface tools for various applications. However, the security implications of LLMs, particularly in relation to adversarial and Trojan attacks, remain insufficiently examined. In this paper, we propose TrojLLM, an automatic and black-box framework to effectively generate universal and stealthy triggers. When these triggers are incorporated into the input data, the LLMs' outputs can be maliciously manipulated. Moreover, the framework also supports embedding Trojans within discrete prompts, enhancing the overall effectiveness and precision of the triggers' attacks. Specifically, we propose a trigger discovery algorithm for generating universal triggers for various inputs by querying victim LLM-based APIs using few-shot data samples. Furthermore, we introduce a novel progressive Trojan poisoning algorithm designed to generate poisoned prompts that retain efficacy and transferability across a diverse range of models. Our experiments and results demonstrate TrojLLM's capacity to effectively insert Trojans into text prompts in real-world black-box LLM APIs including GPT-3.5 and GPT-4, while maintaining exceptional performance on clean test sets. Our work sheds light on the potential security risks in current models and offers a potential defensive approach. The source code of TrojLLM is available at https://github.com/UCF-ML-Research/TrojLLM.
翻译:大型语言模型(LLMs)正逐步被用作各类应用的机器学习服务及接口工具。然而,LLMs在安全层面的影响,尤其是在对抗攻击和木马攻击方面的研究仍不充分。本文提出TrojLLM,一种自动化的黑盒框架,可有效生成通用且隐蔽的触发器。当这些触发器被嵌入输入数据时,LLMs的输出将遭到恶意操控。此外,该框架还支持将木马植入离散提示中,从而提升触发器攻击的整体效果和精确度。具体而言,我们提出一种触发器发现算法,通过利用少量数据样本查询基于受害者LLM的API,生成适用于各类输入的通用触发器。进一步地,我们引入一种新颖的渐进式木马投毒算法,旨在生成能够在多种模型中保持有效性和可迁移性的投毒提示。实验结果表明,TrojLLM能够在包括GPT-3.5和GPT-4在内的真实黑盒LLM API中有效将木马植入文本提示,同时在干净测试集上保持优异性能。本研究揭示了当前模型中的潜在安全风险,并提供了一种可能的防御思路。TrojLLM的源代码可通过https://github.com/UCF-ML-Research/TrojLLM获取。