The growing awareness of safety concerns in large language models (LLMs) has sparked considerable interest in the evaluation of safety within current research endeavors. This study investigates an interesting issue pertaining to the evaluation of LLMs, namely the substantial discrepancy in performance between multiple-choice questions and open-ended questions. Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization. That is, the LLM does not have a comprehensive understanding of the complex concept of safety. Instead, it only remembers what to answer for open-ended safety questions, which makes it unable to solve other forms of safety tests. We refer to this phenomenon as fake alignment and construct a comparative benchmark to empirically verify its existence in LLMs. Such fake alignment renders previous evaluation protocols unreliable. To address this, we introduce the FAEF framework and two novel metrics\textemdash Consistency Score (CS) and Consistent Safety Score (CSS), which jointly assess two complementary forms of evaluation to quantify fake alignment and obtain corrected performance estimates. Applying FAEF to 14 widely-used LLMs reveals several models with purported safety are poorly aligned in practice. Our work highlights potential limitations in prevailing alignment methodologies.
翻译:大语言模型(LLMs)安全问题的日益关注,促使当前研究致力于对其安全性进行评估。本研究探究了LLM评估中一个有趣的问题,即多项选择题与开放式问题表现存在显著差异。受越狱攻击模式研究的启发,我们认为这是由于泛化不匹配所致。也就是说,大语言模型并未全面理解安全这一复杂概念,而仅仅记住了如何回答开放式安全问题的标准答案,导致其无法应对其他形式的安全测试。我们将此现象称为"假对齐",并构建了一个比较基准,通过实验验证其在LLMs中的存在。这种假对齐使得先前的评估协议变得不可靠。为解决此问题,我们提出FAEF框架及两个新型指标——一致性分数(CS)与一致安全分数(CSS),通过联合评估两种互补的评估形式来量化假对齐,并获取修正后的性能估计。将FAEF应用于14个广泛使用的LLMs后发现,部分号称安全的模型在实际应用中对齐效果欠佳。本研究揭示了当前对齐方法中存在的潜在局限性。