Despite the impressive capabilities of Large Language Models (LLMs) in various tasks, their vulnerability to unsafe prompts remains a critical issue. These prompts can lead LLMs to generate responses on illegal or sensitive topics, posing a significant threat to their safe and ethical use. Existing approaches attempt to address this issue using classification models, but they have several drawbacks. With the increasing complexity of unsafe prompts, similarity search-based techniques that identify specific features of unsafe prompts provide a more robust and effective solution to this evolving problem. This paper investigates the potential of sentence encoders to distinguish safe from unsafe prompts, and the ability to classify various unsafe prompts according to a safety taxonomy. We introduce new pairwise datasets and the Categorical Purity (CP) metric to measure this capability. Our findings reveal both the effectiveness and limitations of existing sentence encoders, proposing directions to improve sentence encoders to operate as more robust safety detectors. Our code is available at https://github.com/JwdanielJung/Safe-Embed.
翻译:尽管大型语言模型(LLM)在各种任务中展现出令人印象深刻的能力,但其对不安全提示的脆弱性仍然是一个关键问题。这些提示可能导致LLM生成涉及非法或敏感主题的回应,对其安全与伦理应用构成重大威胁。现有方法试图使用分类模型来解决此问题,但它们存在若干缺陷。随着不安全提示日益复杂,基于相似性搜索、旨在识别不安全提示特定特征的技术,为这一不断演变的问题提供了更鲁棒且有效的解决方案。本文研究了句子编码器区分安全与不安全提示的潜力,以及根据安全分类法对各类不安全提示进行分类的能力。我们引入了新的成对数据集和分类纯度(CP)度量来评估这种能力。我们的研究结果揭示了现有句子编码器的有效性与局限性,并提出了改进句子编码器以作为更鲁棒安全检测器的方向。我们的代码可在 https://github.com/JwdanielJung/Safe-Embed 获取。