How can we precisely estimate a large language model's (LLM) accuracy on questions belonging to a specific topic within a larger question-answering dataset? The standard direct estimator, which averages the model's accuracy on the questions in each subgroup, may exhibit high variance for subgroups (topics) with small sample sizes. Synthetic regression modeling, which leverages the model's accuracy on questions about other topics, may yield biased estimates that are too unreliable for large subgroups. We prescribe a simple yet effective solution: an empirical Bayes (EB) estimator that balances direct and regression estimates for each subgroup separately, improving the precision of subgroup-level estimates of model performance. Our experiments on multiple datasets show that this approach consistently provides more precise estimates of the LLM performance compared to the direct and regression approaches, achieving substantial reductions in the mean squared error. Confidence intervals for EB estimates also have near-nominal coverage and are narrower compared to those for the direct estimator. Additional experiments on tabular and vision data validate the benefits of this EB approach.
翻译:如何精确估计大型语言模型(LLM)在大型问答数据集中特定主题问题上的准确率?标准的直接估计器通过平均模型在每个子组(主题)问题上的准确率来计算,但对于样本量较小的子组(主题),该估计器可能表现出较高的方差。合成回归建模利用模型在其他主题问题上的准确率进行估计,可能产生有偏的估计,对于较大的子组而言过于不可靠。我们提出一种简单而有效的解决方案:一种经验贝叶斯(EB)估计器,该估计器分别对每个子组平衡直接估计与回归估计,从而提高了模型性能在子组层面估计的精度。我们在多个数据集上的实验表明,与直接方法和回归方法相比,该方法能持续提供更精确的LLM性能估计,实现了均方误差的显著降低。EB估计的置信区间也具有接近名义水平的覆盖率,并且比直接估计器的置信区间更窄。在表格数据和视觉数据上的额外实验验证了这种EB方法的优势。