Machine learning and deep learning have been celebrating many successes in the application to biological problems, especially in the domain of protein folding. Another equally complex and important question has received relatively little attention by the machine learning community, namely the one of prediction of complex traits from genetics. Tackling this problem requires in-depth knowledge of the related genetics literature and awareness of various subtleties associated with genetic data. In this guide, we provide an overview for the machine learning community on current state of the art models and associated subtleties which need to be taken into consideration when developing new models for phenotype prediction. We use height as an example of a continuous-valued phenotype and provide an introduction to benchmark datasets, confounders, feature selection, and common metrics.
翻译:机器学习和深度学习在生物学问题中的应用取得了诸多成功,尤其是在蛋白质折叠领域。另一个同样复杂且重要的问题——基于遗传信息预测复杂性状——却相对较少受到机器学习领域的关注。解决这一问题需要深入理解相关遗传学文献,并意识到与遗传数据相关的诸多微妙之处。在本指南中,我们为机器学习社区概述了当前最先进的模型以及开发表型预测新模型时需要考虑的相关细节。我们以身高这一连续型表型为例,介绍了基准数据集、混杂因素、特征选择和常用指标。