Large language models (LLMs), known for their capability in understanding and following instructions, are vulnerable to adversarial attacks. Researchers have found that current commercial LLMs either fail to be "harmless" by presenting unethical answers, or fail to be "helpful" by refusing to offer meaningful answers when faced with adversarial queries. To strike a balance between being helpful and harmless, we design a moving target defense (MTD) enhanced LLM system. The system aims to deliver non-toxic answers that align with outputs from multiple model candidates, making them more robust against adversarial attacks. We design a query and output analysis model to filter out unsafe or non-responsive answers. %to achieve the two objectives of randomly selecting outputs from different LLMs. We evaluate over 8 most recent chatbot models with state-of-the-art adversarial queries. Our MTD-enhanced LLM system reduces the attack success rate from 37.5\% to 0\%. Meanwhile, it decreases the response refusal rate from 50\% to 0\%.
翻译:大型语言模型(LLMs)以其理解和遵循指令的能力著称,但易受对抗性攻击的影响。研究人员发现,当前商业LLM在面对对抗性查询时,要么未能保持“无害”而呈现不道德的回答,要么因拒绝提供有意义答案而未能保持“有益”。为在有益与无害之间取得平衡,我们设计了一种移动目标防御(MTD)增强型LLM系统。该系统旨在提供与多个模型候选输出对齐的无毒答案,从而增强其对抗性攻击的鲁棒性。我们设计了一个查询与输出分析模型,以过滤不安全或无响应的答案,并评估了8个最新聊天机器人模型在最新对抗性查询下的性能。我们的MTD增强型LLM系统将攻击成功率从37.5%降至0%,同时将响应拒绝率从50%降至0%。