Machine learning is increasingly used to discover diagnostic and prognostic biomarkers from high-dimensional molecular data. However, a variety of factors related to experimental design may affect the ability to learn generalizable and clinically applicable diagnostics. Here, we argue that a causal perspective improves the identification of these challenges and formalizes their relation to the robustness and generalization of machine learning-based diagnostics. To make for a concrete discussion, we focus on a specific, recently established high-dimensional biomarker - adaptive immune receptor repertoires (AIRRs). Through simulations, we illustrate how major biological and experimental factors of the AIRR domain may influence the learned biomarkers. In conclusion, we argue that causal modeling improves machine learning-based biomarker robustness by identifying stable relations between variables and by guiding the adjustment of the relations and variables that vary between populations.
翻译:机器学习越来越多地被用于从高维分子数据中发现诊断和预后生物标志物。然而,实验设计中的多种因素可能影响可泛化且临床适用的诊断方法的习得能力。本文提出,因果视角有助于识别这些挑战,并将其与基于机器学习的诊断方法的稳健性和泛化性之间的关系形式化。为便于具体讨论,我们聚焦于一种近期建立的高维生物标志物——适应性免疫受体库(AIRRs)。通过模拟实验,我们展示了AIRR领域的主要生物学和实验因素如何影响学习到的生物标志物。结论表明,因果建模通过识别变量间的稳定关系,并指导调整在人群中存在差异的关系和变量,从而提升基于机器学习的生物标志物的稳健性。