Conformalized Super Learner

The Super Learner (SL) is a widely used ensemble method that combines predictions from a library of learners based on their predictive performance. Interval predictions are of considerable practical interest because they allow uncertainty in predictions produced by an individual learner or an ensemble to be quantified. Several methods have been proposed for constructing interval predictions based on the SL, however, these approaches are typically justified using asymptotic arguments or rely on computationally intensive procedures such as the bootstrap. Conformal prediction (CP) is a machine learning framework for constructing prediction intervals with finite-sample and asymptotic coverage guarantees under mild conditions. We propose coupling CP with the SL through a natural construction that mirrors the original SL framework, using individual learner weights and combining learner-specific conformity scores via a weighted majority vote. We characterize the properties of the resulting SL-based prediction intervals for continuous outcomes. We cover settings under exchangeability, potential violations of exchangeability, and data-generating mechanisms exhibiting heteroscedasticity, sparsity, and other forms of distributional heterogeneity. A comprehensive simulation study shows that the conformalized SL achieves valid finite-sample coverage with competitive performance relative to the true data-generating mechanism. A central contribution of this work is an application to predicting creatinine levels using socio-demographic, biometric, and laboratory measurements. This example demonstrates the benefits of an ensemble with carefully selected learners designed to capture key aspects of complex regression functions, including non-linear effects, interactions, sparsity, heteroscedasticity, and robustness to outliers.R

翻译：超级学习器（SL）是一种广泛使用的集成方法，它基于预测性能组合来自学习器库的预测。区间预测具有重要的实践意义，因为它能够量化单个学习器或集成模型预测结果的不确定性。目前已提出多种基于SL构建区间预测的方法，但这些方法通常依赖于渐近理论论证或需要计算密集型程序（如自助法）。共形预测（CP）是一种机器学习框架，能在温和条件下构建具有有限样本和渐近覆盖保证的预测区间。我们提出通过自然构造将CP与SL结合，该构造镜像原始SL框架，使用个体学习器权重并通过加权多数投票整合学习器特定的一致性分数。我们刻画了由此产生的基于SL的连续结果预测区间的特性。我们涵盖了可交换性条件下的设置、可交换性的潜在违反，以及呈现异方差性、稀疏性和其他分布异质性形式的数据生成机制。全面的模拟研究表明，共形化超级学习器能实现有效的有限样本覆盖，其性能与真实数据生成机制相比具有竞争力。本研究的一项核心贡献是将其应用于利用社会人口学、生物统计学和实验室测量数据预测肌酐水平。该示例展示了精心选择的学习器集成的优势，这些学习器旨在捕捉复杂回归函数的关键特征，包括非线性效应、交互作用、稀疏性、异方差性以及对异常值的稳健性。