LLM benchmarking metrics often misstate performance and uncertainty as they rely on two assumptions that frequently do not hold in practice: (i) a sufficient number of evaluations are available for classical inference, and (ii) test prompts are independent. We propose a corrective Bayesian hierarchical model with embedding-space clustering that provides robust performance metrics in limited-data settings while correcting for prompt dependence. We apply the approach to adversarial robustness benchmarks, showing consistent recovery of clustering structure, resulting in more reliable performance metrics, with 4-73% improvements to mean absolute errors and 40-450 unit improvements to expected log posterior densities.
翻译:大语言模型基准测试指标常常错误地报告性能和不确定性,因为它们依赖于两个在实践中经常不成立的假设:(i) 存在足够数量的评估以供经典推断;(ii) 测试提示是相互独立的。我们提出了一种基于嵌入空间聚类的修正贝叶斯层次模型,该模型能在有限数据环境下提供稳健的性能指标,同时纠正提示依赖问题。我们将该方法应用于对抗鲁棒性基准测试,结果表明该方法能一致地恢复聚类结构,从而得到更可靠的性能指标,其中平均绝对误差改善4-73%,期望对数后验密度提升40-450单位。