Exhaustively evaluating many large language models (LLMs) on a large suite of benchmarks is expensive. We cast benchmarking as finite-population inference and, under a fixed query budget, seek tight confidence intervals (CIs) for model accuracy with valid frequentist coverage. We propose Factorized Active Querying (FAQ), which (a) leverages historical information through a Bayesian factor model; (b) adaptively selects questions using a hybrid variance-reduction/active-learning sampling policy; and (c) maintains validity through Proactive Active Inference -- a finite-population extension of active inference (Zrnic & Candès, 2024) that enables direct question selection while preserving coverage. With negligible overhead cost, FAQ delivers up to $5\times$ effective sample size gains over strong baselines on two benchmark suites, across varying historical-data missingness levels: this means that it matches the CI width of uniform sampling while using up to $5\times$ fewer queries. We release our source code and our curated datasets to support reproducible evaluation and future research.
翻译:在大规模基准测试套件上详尽评估众多大语言模型(LLM)成本高昂。我们将基准测试构建为有限总体推断问题,在固定查询预算下,寻求具有有效频率学派覆盖率的模型准确率紧置信区间(CI)。本文提出因子化主动查询方法(FAQ),其具备以下特征:(a)通过贝叶斯因子模型利用历史信息;(b)采用混合方差缩减/主动学习采样策略自适应选择问题;(c)通过主动推断的有限总体扩展形式——前瞻性主动推断(Zrnic & Candès, 2024)保持有效性,该方法在保持覆盖率的同时支持直接问题选择。在可忽略的额外成本下,FAQ在两个基准测试套件上(涵盖不同历史数据缺失水平)相比强基线方法实现了高达$5\times$的有效样本量增益:这意味着其在使用最多$5\times$更少查询量的情况下,可获得与均匀采样相当的置信区间宽度。我们开源了实现代码与整理的数据集,以支持可复现的评估与未来研究。