Active Statistical Inference is a new framework to make precise claims about population parameters with provable statistical guarantees. It uses a predictive "black-box" machine learning (ML) model to strategically decide which data points to label, roughly prioritizing samples for which the ML model is unsure about their label values. A major issue is that the framework can be brittle when uncertainty estimates are noisy. This paper introduces OPAL (Optimized Policy for Allocation of Labels), which learns a labeling strategy within a tractable class of smooth policies to yield estimators with the lowest variance. In effect, OPAL is an end-to-end pipeline that turns a black-box model's uncertainty scores into a data-adaptive labeling strategy and then performs inference on the collected samples. We evaluate OPAL on real datasets spanning medical imaging data, computational social science, and proteomics. As a concrete example, we consider predicting breast cancer subtype from histopathology images and using OPAL to form valid confidence intervals for odds ratios for different demographic groups. We show that OPAL achieves nominal coverage in finite samples and has the accuracy one expects from methods which have far more labeled samples.
翻译:主动统计推断是一种新框架,能以可证明的统计保证对总体参数做出精确论断。该框架利用预测性"黑箱"机器学习模型,策略性地决定哪些数据点需要标注,其核心逻辑是优先标注机器学习模型对其标签值不确定的样本。主要问题在于,当不确定性估计存在噪声时,该框架可能表现脆弱。本文提出OPAL(标注分配优化策略),该方法在平滑策略的可处理类中学习标注策略,从而得到方差最小的估计量。本质上,OPAL是一条端到端流程,能将黑箱模型的不确定性得分转化为数据自适应的标注策略,并对收集的样本进行推断。我们在涵盖医学影像数据、计算社会科学和蛋白质组学的真实数据集上评估了OPAL。以具体案例为证,我们考虑从组织病理学图像预测乳腺癌亚型,并利用OPAL为不同人口统计群体的优势比构建有效置信区间。实验表明,OPAL在有限样本下实现了名义覆盖度,且其准确度可媲美需要多得多标注样本的方法。