Practical considerations for specifying a super learner

from arxiv, A revised version of this article has been accepted for publication in the International Journal of Epidemiology published by Oxford University Press

Common tasks encountered in epidemiology, including disease incidence estimation and causal inference, rely on predictive modeling. Constructing a predictive model can be thought of as learning a prediction function, i.e., a function that takes as input covariate data and outputs a predicted value. Many strategies for learning these functions from data are available, from parametric regressions to machine learning algorithms. It can be challenging to choose an approach, as it is impossible to know in advance which one is the most suitable for a particular dataset and prediction task at hand. The super learner (SL) is an algorithm that alleviates concerns over selecting the one "right" strategy while providing the freedom to consider many of them, such as those recommended by collaborators, used in related research, or specified by subject-matter experts. It is an entirely pre-specified and data-adaptive strategy for predictive modeling. To ensure the SL is well-specified for learning the prediction function, the analyst does need to make a few important choices. In this Education Corner article, we provide step-by-step guidelines for making these choices, walking the reader through each of them and providing intuition along the way. In doing so, we aim to empower the analyst to tailor the SL specification to their prediction task, thereby ensuring their SL performs as well as possible. A flowchart provides a concise, easy-to-follow summary of key suggestions and heuristics, based on our accumulated experience, and guided by theory.

翻译：流行病学中常见的任务，包括疾病发病率估计和因果推断，都依赖于预测建模。构建预测模型可以被视为学习一个预测函数，即一个接受协变量数据作为输入并输出预测值的函数。从参数回归到机器学习算法，有许多从数据中学习这些函数的策略。选择一种方法可能具有挑战性，因为无法预先知道哪一种最适合手头特定的数据集和预测任务。超级学习器（SL）是一种算法，它缓解了选择单一“正确”策略的顾虑，同时提供了考虑多种策略（例如合作者推荐的、相关研究中使用的或领域专家指定的策略）的自由。它是一种完全预先指定且数据自适应的预测建模策略。为了确保SL能够良好地指定以学习预测函数，分析者确实需要做出一些重要的选择。在这篇《教育角》文章中，我们提供了逐步指导，帮助做出这些选择，引导读者逐一理解它们，并在此过程中提供直观理解。通过这样做，我们旨在使分析者能够根据其预测任务定制SL规范，从而确保其SL尽可能发挥最佳性能。流程图基于我们积累的经验和理论指导，提供了关键建议和启发法的简洁易读总结。