We study fixed-confidence best-arm identification (BAI) where a cheap but potentially biased proxy (e.g., LLM judge) is available for every sample, while an expensive ground-truth label can only be acquired selectively when using a human for auditing. Unlike classical multi-fidelity BAI, the proxy is biased (arm- and context-dependent) and ground truth is selectively observed. Consequently, standard multi-fidelity methods can mis-select the best arm, and uniform auditing, though accurate, wastes scarce resources and is inefficient. We prove that without bias correction and propensity adjustment, mis-selection probability may not vanish (even with unlimited proxy data). We then develop an estimator for the mean of each arm that combines proxy scores with inverse-propensity-weighted residuals and form anytime-valid confidence sequences for that estimator. Based on the estimator and confidence sequence, we propose an algorithm that adaptively selects and audits arms. The algorithm concentrates audits on unreliable contexts and close arms and we prove that a plug-in Neyman rule achieves near-oracle audit efficiency. Numerical experiments confirm the theoretical guarantees and demonstrate the superior empirical performance of the proposed algorithm.
翻译:本文研究固定置信度下的最佳臂识别问题,其中每个样本均可获得廉价但可能存在偏差的代理评估(如大语言模型评估器),而昂贵的真实标注仅能通过选择性人工审核获取。与经典多保真度最佳臂识别不同,本文所设代理评估存在偏差(依赖臂与上下文),且真实标注为选择性观测。因此,标准多保真度方法可能导致错误选择最佳臂,而均匀审核策略虽能保证精度,却会浪费稀缺资源且效率低下。我们证明:若不进行偏差校正与倾向性调整,即使拥有无限代理数据,错误选择概率仍可能不收敛于零。基于此,我们构建了融合代理评分与逆倾向加权残差的臂期望估计量,并为其构造了任意时间有效的置信序列。基于该估计量与置信序列,我们提出一种自适应选择与审核臂的算法。该算法将审核资源集中于不可靠上下文与相近臂,并证明采用插值尼曼规则可实现接近最优的审核效率。数值实验验证了理论保证,并证实所提算法具有优越的实证性能。