Consider a scenario where a harmfulness detection metric is employed by a system to filter unsafe responses generated by a Large Language Model. When analyzing individual harmful and unethical prompt-response pairs, the metric correctly classifies each pair as highly unsafe, assigning the highest score. However, when these same prompts and responses are concatenated, the metric's decision flips, assigning the lowest possible score, thereby misclassifying the content as safe and allowing it to bypass the filter. In this study, we discovered that several harmfulness LLM-based metrics, including GPT-based, exhibit this decision-flipping phenomenon. Additionally, we found that even an advanced metric like GPT-4o is highly sensitive to input order. Specifically, it tends to classify responses as safe if the safe content appears first, regardless of any harmful content that follows, and vice versa. This work introduces automatic concatenation-based tests to assess the fundamental properties a valid metric should satisfy. We applied these tests in a model safety scenario to assess the reliability of harmfulness detection metrics, uncovering a number of inconsistencies.
翻译:考虑这样一个场景:一个系统采用有害性检测指标来过滤大型语言模型生成的不安全回复。在分析单个有害和不道德的提示-回复对时,该指标能正确地将每对分类为高度不安全,并赋予最高分。然而,当这些相同的提示和回复被拼接起来时,该指标的判断会发生翻转,赋予最低可能分数,从而将内容误分类为安全并允许其绕过过滤器。在本研究中,我们发现包括基于GPT在内的几种基于LLM的有害性指标都表现出这种决策翻转现象。此外,我们发现即使是像GPT-4o这样的先进指标也对输入顺序高度敏感。具体而言,如果安全内容出现在前面,它倾向于将回复分类为安全,而不管后面是否跟随有害内容,反之亦然。这项工作引入了基于自动拼接的测试,以评估一个有效指标应满足的基本属性。我们在模型安全场景中应用这些测试来评估有害性检测指标的可靠性,发现了若干不一致性。