In this paper, we argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise, such as small datasets, methodological inconsistencies, and unreliable evaluation setups. This can, at times, make it impossible to evaluate and compare attacks and defenses fairly, thereby slowing progress. We systematically analyze the LLM safety evaluation pipeline, covering dataset curation, optimization strategies for automated red-teaming, response generation, and response evaluation using LLM judges. At each stage, we identify key issues and highlight their practical impact. We also propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers. Lastly, we offer an opposing perspective, highlighting practical reasons for existing limitations. We believe that addressing the outlined problems in future research will improve the field's ability to generate easily comparable results and make measurable progress.
翻译:在本文中,我们认为当前针对大型语言模型的安全对齐研究因众多相互交织的噪声源而受到阻碍,例如小规模数据集、方法论不一致以及不可靠的评估设置。这有时使得无法公平评估和比较攻击与防御方法,从而减缓了研究进展。我们系统分析了LLM安全评估流程,涵盖数据集构建、自动化红队测试的优化策略、响应生成以及使用LLM裁判的响应评估。在每个阶段,我们识别出关键问题并强调其实际影响。我们还提出了一套指南,以减少未来攻击与防御论文评估中的噪声和偏差。最后,我们提供了相反视角,指出现有局限性的实际原因。我们相信,在未来的研究中解决所述问题将提升该领域生成易于比较的结果并取得可衡量进展的能力。