In recent years, Large Language Models (LLM) have emerged as pivotal tools in various applications. However, these models are susceptible to adversarial prompt attacks, where attackers can carefully curate input strings that lead to undesirable outputs. The inherent vulnerability of LLMs stems from their input-output mechanisms, especially when presented with intensely out-of-distribution (OOD) inputs. This paper proposes a token-level detection method to identify adversarial prompts, leveraging the LLM's capability to predict the next token's probability. We measure the degree of the model's perplexity and incorporate neighboring token information to encourage the detection of contiguous adversarial prompt sequences. As a result, we propose two methods: one that identifies each token as either being part of an adversarial prompt or not, and another that estimates the probability of each token being part of an adversarial prompt.
翻译:近年来,大型语言模型(LLM)已成为各类应用中的关键工具。然而,这类模型易受对抗性提示攻击的影响——攻击者可精心构造输入字符串,导致模型产生非期望输出。LLM的固有脆弱性源于其输入-输出机制,尤其在处理严重偏离训练数据分布(OOD)的输入时更为显著。本文提出一种基于词元级别的检测方法,通过利用LLM预测下一词元概率的能力来识别对抗性提示。我们通过衡量模型的困惑度并融合相邻词元信息,增强对连续对抗性提示序列的检测能力。据此提出两种方法:其一将每个词元判定为是否属于对抗性提示的组成部分,其二则估算每个词元属于对抗性提示的概率。