Inspired by the concept of active learning, we propose active inference$\unicode{x2013}$a methodology for statistical inference with machine-learning-assisted data collection. Assuming a budget on the number of labels that can be collected, the methodology uses a machine learning model to identify which data points would be most beneficial to label, thus effectively utilizing the budget. It operates on a simple yet powerful intuition: prioritize the collection of labels for data points where the model exhibits uncertainty, and rely on the model's predictions where it is confident. Active inference constructs provably valid confidence intervals and hypothesis tests while leveraging any black-box machine learning model and handling any data distribution. The key point is that it achieves the same level of accuracy with far fewer samples than existing baselines relying on non-adaptively-collected data. This means that for the same number of collected samples, active inference enables smaller confidence intervals and more powerful p-values. We evaluate active inference on datasets from public opinion research, census analysis, and proteomics.
翻译:受主动学习概念的启发,我们提出了主动推断——一种借助机器学习辅助数据收集的统计推断方法。假设可用于收集标签的预算有限,该方法利用机器学习模型识别哪些数据点最有益于标记,从而有效利用预算。其运作基于一个简单而强大的直觉:优先收集模型表现出不确定性的数据点的标签,而对于模型确信的标签则依赖其预测。主动推断在利用任意黑盒机器学习模型并处理任意数据分布的同时,能够构造出具有理论保证的置信区间和假设检验。关键在于,相比依赖非自适应收集数据的现有基准方法,它能用少得多的样本达到相同的准确度。这意味着,在收集样本数量相同的情况下,主动推断能够实现更窄的置信区间和更强大的p值。我们在来自民意调查、人口普查分析和蛋白质组学的数据集上评估了主动推断方法。