The widespread adoption of large language models (LLMs) across various regions underscores the urgent need to evaluate their alignment with human values. Current benchmarks, however, fall short of effectively uncovering safety vulnerabilities in LLMs. Despite numerous models achieving high scores and 'topping the chart' in these evaluations, there is still a significant gap in LLMs' deeper alignment with human values and achieving genuine harmlessness. To this end, this paper proposes a value alignment benchmark named Flames, which encompasses both common harmlessness principles and a unique morality dimension that integrates specific Chinese values such as harmony. Accordingly, we carefully design adversarial prompts that incorporate complex scenarios and jailbreaking methods, mostly with implicit malice. By prompting 17 mainstream LLMs, we obtain model responses and rigorously annotate them for detailed evaluation. Our findings indicate that all the evaluated LLMs demonstrate relatively poor performance on Flames, particularly in the safety and fairness dimensions. We also develop a lightweight specified scorer capable of scoring LLMs across multiple dimensions to efficiently evaluate new models on the benchmark. The complexity of Flames has far exceeded existing benchmarks, setting a new challenge for contemporary LLMs and highlighting the need for further alignment of LLMs. Our benchmark is publicly available at https://github.com/AIFlames/Flames.
翻译:大型语言模型(LLMs)在各地区的广泛采用凸显了评估其与人类价值观对齐的迫切需求。然而,现有基准在有效揭示LLM安全漏洞方面仍显不足。尽管众多模型在这些评估中获得高分并"登顶榜首",但LLM与人类价值观的深层对齐及实现真正无害性仍存在显著差距。为此,本文提出名为Flames的价值观对齐基准,该基准既包含通用无害性原则,又融合了"和谐"等中华特有价值观的道德维度。我们据此精心设计了融合复杂场景与越狱方法的对抗性提示,其中大部分隐含恶意。通过提示17个主流LLM,我们获取模型响应并严格标注以进行详细评估。研究发现,所有被评估的LLM在Flames上的表现均相对较差,尤其在安全性与公平性维度。我们还开发了轻量级专用评分器,能够跨多维度对LLM进行评分,以高效评估新模型在该基准上的性能。Flames的复杂度已远超现有基准,为当代LLM设立了新挑战,凸显了进一步对齐LLM的必要性。本基准开源地址为:https://github.com/AIFlames/Flames。