Many model evaluation tasks reduce to estimating an average loss, error rate, or subgroup metric on a stratified pool when each label, human rating, or simulator call is costly. The precision-optimal Neyman allocation depends on within-stratum variances, which must be learned from the same observations used for estimation. We formulate this as a sequential allocation problem and use the exact one-step marginal variance reduction as the priority index. Replacing the unknown variances by independent inverse-chi-squared posterior draws yields TS-Neyman, a Thompson-sampling rule that preserves the oracle marginal-gain structure while randomizing over variance uncertainty. For any fixed finite number of strata, we prove almost-sure convergence of the TS-Neyman allocation proportions to the Neyman target, asymptotic optimality of the variance proxy, and a central limit theorem for the resulting adaptive stratified estimator. In two five-stratum budget-scaling benchmarks, one bounded-loss benchmark and one binary model-error benchmark in the spirit of Dai et al. 2023, TS-Neyman's relative efficiency stays within 5 percent of the oracle on the bounded-loss population and within about 15 percent on the binary benchmark. In an additional CivilComments real-data replay with confidence-based strata, it stays within about 8 percent of the oracle and improves on equal allocation by roughly 7 to 14 percent in MSE across budgets, while plug-in greedy and two-stage plug-in can degrade by over an order of magnitude under sparse pilots. Common-pilot warm-start and prior-sensitivity studies show that this behavior is stable under working-model and working-prior misspecification.
翻译:[摘要] 许多模型评估任务可归结为:在标注成本(人工评分、模拟器调用)高昂的分层池中,估计平均损失、错误率或子组指标。精度最优的内曼分配取决于层内方差,而该方差必须从用于估计的同一观测中学习。我们将此建模为序贯分配问题,并以精确的一步边际方差缩减作为优先指标。通过独立逆卡方后验采样替换未知方差,得到TS-Neyman——一种汤普森采样规则,它在保留最优边际增益结构的同时,对方差不确定性进行随机化。对任意固定有限层数,我们证明了TS-Neyman分配比例几乎必然收敛至内曼目标,方差代理的渐近最优性,以及所得自适应分层估计量的中心极限定理。在两个五层预算缩放基准(一个有界损失基准,另一个遵循Dai et al. 2023思路的二元模型误差基准)中,TS-Neyman的相对效率在有界损失群体上保持在最优值的5%以内,在二元基准上保持在15%以内。在额外基于置信度分层的CivilComments真实数据回溯实验中,其效率保持在最优值的8%以内,各预算下均方误差相比等额分配提升约7%至14%,而贪婪插件法和两阶段插件法在稀疏预试验下可能退化超一个数量级。公共预试验热启动和先验敏感性研究表明,该行为在工作模型与工作先验设定错误下保持稳定。