The increasing use of large language models (LLMs) trained by third parties raises significant security concerns. In particular, malicious actors can introduce backdoors through poisoning attacks to generate undesirable outputs. While such attacks have been extensively studied in image domains and classification tasks, they remain underexplored for natural language generation (NLG) tasks. To address this gap, we conduct an investigation of various poisoning techniques targeting the LLM's fine-tuning phase via prefix-tuning, a Parameter Efficient Fine-Tuning (PEFT) method. We assess their effectiveness across two generative tasks: text summarization and text completion; and we also introduce new metrics to quantify the success and stealthiness of such NLG poisoning attacks. Through our experiments, we find that the prefix-tuning hyperparameters and trigger designs are the most crucial factors to influence attack success and stealthiness. Moreover, we demonstrate that existing popular defenses are ineffective against our poisoning attacks. Our study presents the first systematic approach to understanding poisoning attacks targeting NLG tasks during fine-tuning via PEFT across a wide range of triggers and attack settings. We hope our findings will aid the AI security community in developing effective defenses against such threats.
翻译:第三方训练的大型语言模型日益广泛的应用引发了严重的安全担忧。具体而言,恶意行为者可通过投毒攻击植入后门,以产生不良输出。尽管此类攻击在图像领域和分类任务中已得到广泛研究,但在自然语言生成任务中仍缺乏深入探索。为填补这一空白,我们研究了针对LLM通过前缀调优这一参数高效微调方法进行微调阶段的各种投毒技术。我们在文本摘要和文本补全两项生成任务中评估了这些技术的有效性,并引入了新的量化指标来衡量此类NLG投毒攻击的成功率与隐蔽性。实验表明,前缀调优的超参数和触发器设计是影响攻击成功率与隐蔽性的最关键因素。此外,我们证明现有的主流防御方法对我们的投毒攻击均无效。本研究首次系统性地探讨了在PEFT微调过程中针对NLG任务的投毒攻击,涵盖了广泛的触发器类型和攻击场景。我们希望研究结果能帮助AI安全社区开发有效的防御机制来应对此类威胁。