Computerized Adaptive Testing (CAT) has proven effective for efficient LLM evaluation on multiple-choice benchmarks, but modern LLM evaluation increasingly relies on generation tasks where outputs are scored continuously rather than marked correct/incorrect. We present a principled extension of IRT-based adaptive testing to continuous bounded scores (ROUGE, BLEU, LLM-as-a-Judge) by replacing the Bernoulli response distribution with a heteroskedastic normal distribution. Building on this, we introduce an uncertainty aware ranker with adaptive stopping criteria that achieves reliable model ranking while testing as few items and as cheaply as possible. We validate our method on five benchmarks spanning n-gram-based, embedding-based, and LLM-as-judge metrics. Our method uses 2% of the items while improving ranking correlation by 0.12 τ over random sampling, with 95% accuracy on confident predictions.
翻译:计算机化自适应测试已被证明在多选题基准测试中能有效提升大语言模型评估效率,然而现代大语言模型评估日益依赖于生成任务,其输出采用连续评分而非二元正误判断。我们通过用异方差正态分布替代伯努利响应分布,提出了基于项目反应理论的自适应测试向连续有界评分体系的原理性扩展。在此基础上,我们引入具备自适应停止准则的不确定性感知排序器,在尽可能减少测试项目与降低成本的同时实现可靠的模型排序。我们在涵盖基于n-元语法、基于嵌入表示和大语言模型即评判者五大基准测试上验证了本方法。该方法仅需使用2%的测试项目,即可将排序相关性较随机采样提升0.12 τ值,且在置信预测中达到95%的准确率。