The preservation of soil health is a critical challenge in the 21st century due to its significant impact on agriculture, human health, and biodiversity. We provide the first deep investigation of the predictive potential of machine learning models to understand the connections between soil and biological phenotypes. We investigate an integrative framework performing accurate machine learning-based prediction of plant phenotypes from biological, chemical, and physical properties of the soil via two models: random forest and Bayesian neural network. We show that prediction is improved when incorporating environmental features like soil physicochemical properties and microbial population density into the models, in addition to the microbiome information. Exploring various data preprocessing strategies confirms the significant impact of human decisions on predictive performance. We show that the naive total sum scaling normalization that is commonly used in microbiome research is not the optimal strategy to maximize predictive power. Also, we find that accurately defined labels are more important than normalization, taxonomic level or model characteristics. In cases where humans are unable to classify samples accurately, machine learning model performance is limited. Lastly, we provide domain scientists via a full model selection decision tree to identify the human choices that optimize model prediction power. Our work is accompanied by open source reproducible scripts (https://github.com/solislemuslab/soil-microbiome-nn) for maximum outreach among the microbiome research community.
翻译:土壤健康保护是21世纪的关键挑战,因其对农业、人类健康和生物多样性具有重大影响。我们首次深入探究了机器学习模型在理解土壤与生物表型之间关联的预测潜力。通过两种模型——随机森林和贝叶斯神经网络——我们研究了一个集成框架,该框架基于土壤的生物、化学和物理特性对植物表型进行精准的机器学习预测。研究表明,在模型中纳入土壤理化性质、微生物种群密度等环境特征(除微生物组信息外)可提升预测性能。对多种数据预处理策略的探索证实,人类决策对预测性能具有显著影响。我们指出,微生物组研究中常用的朴素总丰度缩放归一化方法并非最大化预测能力的最优策略。此外,我们发现精确标注的标签比归一化方法、分类学层级或模型特征更为重要。当人类无法准确分类样本时,机器学习模型性能将受到限制。最后,我们为领域科学家提供了完整的模型选择决策树,以识别优化模型预测能力的人类选择方案。本研究配套开源可复现脚本(https://github.com/solislemuslab/soil-microbiome-nn),旨在最大程度惠及微生物组研究社群。