The growing awareness of safety concerns in large language models (LLMs) has sparked considerable interest in the evaluation of safety within current research endeavors. This study investigates an interesting issue pertaining to the evaluation of LLMs, namely the substantial discrepancy in performance between multiple-choice questions and open-ended questions. Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization. That is, the LLM does not have a comprehensive understanding of the complex concept of safety. Instead, it only remembers what to answer for open-ended safety questions, which makes it unable to solve other forms of safety tests. We refer to this phenomenon as fake alignment and construct a comparative benchmark to empirically verify its existence in LLMs. Such fake alignment renders previous evaluation protocols unreliable. To address this, we introduce the Fake alIgNment Evaluation (FINE) framework and two novel metrics--Consistency Score (CS) and Consistent Safety Score (CSS), which jointly assess two complementary forms of evaluation to quantify fake alignment and obtain corrected performance estimates. Applying FINE to 14 widely-used LLMs reveals several models with purported safety are poorly aligned in practice. Our work highlights potential limitations in prevailing alignment methodologies.
翻译:大语言模型(LLMs)安全问题的日益关注引发了当前研究中对安全评估的浓厚兴趣。本研究探讨了LLM评估中的一个有趣问题,即多项选择题与开放式问题在性能上的显著差异。受越狱攻击模式研究的启发,我们认为这是由泛化失配导致的。也就是说,LLM并未全面理解安全的复杂概念,而只是记住了如何在开放式安全问题上作答,这使其无法解决其他形式的安全测试。我们将这一现象称为虚假对齐,并构建了对比基准以实证验证其在LLM中的存在。这种虚假对齐使得先前的评估协议不可靠。为此,我们提出了虚假对齐评估(FINE)框架及两个新指标——一致性得分(CS)和一致安全得分(CSS),通过联合评估两种互补的测试形式来量化虚假对齐并获取修正后的性能估计。将FINE应用于14个广泛使用的LLM表明,一些声称安全的模型在实践中对齐效果不佳。我们的工作揭示了当前对齐方法可能存在的局限性。