The rapid growth of research in LLM safety makes it hard to track all advances. Benchmarks are therefore crucial for capturing key trends and enabling systematic comparisons. Yet, it remains unclear why certain benchmarks gain prominence, and no systematic assessment has been conducted on their academic influence or code quality. This paper fills this gap by presenting the first multi-dimensional evaluation of the influence (based on five metrics) and code quality (based on both automated and human assessment) on LLM safety benchmarks, analyzing 31 benchmarks and 382 non-benchmarks across prompt injection, jailbreak, and hallucination. We find that benchmark papers show no significant advantage in academic influence (e.g., citation count and density) over non-benchmark papers. We uncover a key misalignment: while author prominence correlates with paper influence, neither author prominence nor paper influence shows a significant correlation with code quality. Our results also indicate substantial room for improvement in code and supplementary materials: only 39% of repositories are ready-to-use, 16% include flawless installation guides, and a mere 6% address ethical considerations. Given that the work of prominent researchers tends to attract greater attention, they need to lead the effort in setting higher standards.
翻译:LLM安全研究的快速增长使得跟踪所有进展变得困难。因此,基准对于捕捉关键趋势和实现系统比较至关重要。然而,目前尚不清楚为何某些基准获得了突出地位,也缺乏对其学术影响力或代码质量的系统性评估。本文填补了这一空白,首次对LLM安全基准进行了影响力(基于五项指标)和代码质量(基于自动化和人工评估)的多维度评估,分析了涵盖提示注入、越狱和幻觉的31个基准和382篇非基准论文。我们发现,基准论文在学术影响力(如引用次数和密度)上并未显示出相对于非基准论文的显著优势。我们揭示了一个关键的不匹配现象:虽然作者知名度与论文影响力相关,但作者知名度或论文影响力均未与代码质量显示出显著相关性。我们的结果还表明,代码和补充材料有巨大的改进空间:仅有39%的仓库是开箱即用的,16%包含了无瑕疵的安装指南,只有6%涉及伦理考量。鉴于知名研究者的工作往往吸引更多关注,他们需要带头努力设定更高的标准。