Large Language Models (LLMs) have showcased remarkable capabilities in following human instructions. However, recent studies have raised concerns about the robustness of LLMs when prompted with instructions combining textual adversarial samples. In this paper, drawing inspiration from recent works that LLMs are sensitive to the design of the instructions, we utilize instructions in code style, which are more structural and less ambiguous, to replace typically natural language instructions. Through this conversion, we provide LLMs with more precise instructions and strengthen the robustness of LLMs. Moreover, under few-shot scenarios, we propose a novel method to compose in-context demonstrations using both clean and adversarial samples (\textit{adversarial context method}) to further boost the robustness of the LLMs. Experiments on eight robustness datasets show that our method consistently outperforms prompting LLMs with natural language instructions. For example, with gpt-3.5-turbo, our method achieves an improvement of 5.68\% in test set accuracy and a reduction of 5.66 points in Attack Success Rate (ASR).
翻译:大语言模型(LLMs)在遵循人类指令方面展现出了卓越的能力。然而,近期研究对LLMs在遭遇结合文本对抗样本的指令时的鲁棒性提出了担忧。本文受近期关于LLMs对指令设计敏感的研究启发,利用更具结构性且歧义性较低的代码风格指令替代常规自然语言指令。通过这种转换,我们为LLMs提供更精确的指令并增强其鲁棒性。此外,在少样本场景下,我们提出了一种新颖的方法,通过混合使用干净样本与对抗样本来构建上下文示例(即对抗上下文方法),从而进一步提升LLMs的鲁棒性。在八个鲁棒性数据集上的实验表明,我们的方法始终优于使用自然语言指令提示LLMs的方法。例如,使用gpt-3.5-turbo时,我们的方法实现了测试集准确率5.68%的提升,并将攻击成功率(ASR)降低了5.66个百分点。