Publishing a large language model (LLM) benchmark (especially its ground-truth answers) on the Internet risks contaminating future LLMs and enabling evaluation gaming: it may be unintentionally (or intentionally) used to train or select a model, or exploited to overfit and hack leaderboards when labels are accessible. A common mitigation is to keep the benchmark private and let participants submit their models or predictions to the organizers, but this still permits test-set overfitting through feedback loops. To overcome this issue, we propose CapBencher, a way to publish benchmarks without fully disclosing the ground-truth answers, while preserving open evaluation of LLMs. The main idea is to reduce the best possible accuracy, i.e., Bayes accuracy, by injecting randomness to the answers by preparing several logically correct answers, and only include one of them as the solution in the benchmark. Not only does this obscure the ground-truth answers, but it also offers a test for leakage or gaming: since even fully capable models should not surpass the Bayes accuracy, any model that does is a strong signal. We show theoretically and empirically that CapBencher accurately detects test-set overfitting across diverse benchmarks, models, training methodologies, and scenarios.
翻译:在互联网上发布大语言模型(LLM)基准测试(尤其是其真实答案)存在污染未来LLM并引发评估作弊的风险:这些数据可能被无意(或有意)用于训练或选择模型,或在标签可获取时被利用来过度拟合和篡改排行榜。常见的缓解措施是保持基准测试的私密性,让参与者向组织者提交模型或预测结果,但这仍然允许通过反馈循环产生测试集过拟合。为克服这一问题,我们提出CapBencher——一种在不对LLM进行完全公开评估的同时,发布不完整披露真实答案的基准测试方法。其核心思想是通过准备多个逻辑正确的答案,仅将其中一个作为基准测试的解决方案注入随机性,从而降低最佳可能准确率(即贝叶斯准确率)。这不仅掩盖了真实答案,还提供了检测数据泄露或作弊的手段:由于即使完全具备能力的模型也不应超过贝叶斯准确率,任何超越该准确率的模型都构成强烈预警信号。我们从理论和实验两方面证明,CapBencher能够在不同基准测试、模型、训练方法和场景中准确检测测试集过拟合。