Practical considerations for specifying a super learner

from arxiv, A revised version of this article, which incorporates several modifications based on referees' suggestions, has been published in the International Journal of Epidemiology by Oxford University Press

Common tasks encountered in epidemiology, including disease incidence estimation and causal inference, rely on predictive modeling. Constructing a predictive model can be thought of as learning a prediction function, i.e., a function that takes as input covariate data and outputs a predicted value. Many strategies for learning these functions from data are available, from parametric regressions to machine learning algorithms. It can be challenging to choose an approach, as it is impossible to know in advance which one is the most suitable for a particular dataset and prediction task at hand. The super learner (SL) is an algorithm that alleviates concerns over selecting the one "right" strategy while providing the freedom to consider many of them, such as those recommended by collaborators, used in related research, or specified by subject-matter experts. It is an entirely pre-specified and data-adaptive strategy for predictive modeling. To ensure the SL is well-specified for learning the prediction function, the analyst does need to make a few important choices. In this Education Corner article, we provide step-by-step guidelines for making these choices, walking the reader through each of them and providing intuition along the way. In doing so, we aim to empower the analyst to tailor the SL specification to their prediction task, thereby ensuring their SL performs as well as possible. A flowchart provides a concise, easy-to-follow summary of key suggestions and heuristics, based on our accumulated experience, and guided by theory.

翻译：流行病学中常见的任务，包括疾病发病率估计和因果推断，都依赖于预测建模。构建预测模型可被视为学习一个预测函数，即输入协变量数据并输出预测值的函数。从参数回归到机器学习算法，有很多从数据中学习这些函数的策略可用。由于不可能提前知道哪种策略最适合当前特定数据集和预测任务，选择一种方法可能具有挑战性。超学习器（SL）是一种缓解对选择某一种“正确”策略的担忧的算法，同时提供考虑许多策略的自由，例如合作者推荐的、相关研究中使用的或领域专家指定的策略。它是一种完全预设且数据自适应的预测建模策略。为确保SL能良好地指定用于学习预测函数，分析人员确实需要做出一些重要选择。在这篇《教育园地》文章中，我们提供了逐步指导方针以做出这些选择，引导读者逐一了解每个选择，并在此过程中提供直观理解。通过这样做，我们旨在使分析人员能够根据其预测任务定制SL的规范，从而确保其SL尽可能表现良好。一个流程图提供了关键建议和启发式方法的简洁、易于遵循的总结，这些建议基于我们积累的经验并受理论指导。