Conformal prediction is a non-parametric technique for constructing prediction intervals or sets from arbitrary predictive models under the assumption that the data is exchangeable. It is popular as it comes with theoretical guarantees on the marginal coverage of the prediction sets and the split conformal prediction variant has a very low computational cost compared to model training. We study the robustness of split conformal prediction in a data contamination setting, where we assume a small fraction of the calibration scores are drawn from a different distribution than the bulk. We quantify the impact of the corrupted data on the coverage and efficiency of the constructed sets when evaluated on "clean" test points, and verify our results with numerical experiments. Moreover, we propose an adjustment in the classification setting which we call Contamination Robust Conformal Prediction, and verify the efficacy of our approach using both synthetic and real datasets.
翻译:共形预测是一种非参数技术,用于在数据可交换性假设下,从任意预测模型构建预测区间或集合。该方法因具有预测集合边缘覆盖率的理论保证而广受欢迎,且分裂共形预测变体相比模型训练具有极低计算成本。本研究探讨分裂共形预测在数据污染场景下的鲁棒性,假设少量校准分数来自与主体不同的分布。我们量化了污染数据在"干净"测试点上对构建集合的覆盖率和效率的影响,并通过数值实验验证了结果。此外,我们提出了一种在分类场景下的调整方法——污染鲁棒共形预测,并利用合成与真实数据集验证了该方法的有效性。