Human Limits in Machine Learning: Prediction of Plant Phenotypes Using Soil Microbiome Data

The preservation of soil health is a critical challenge in the 21st century due to its significant impact on agriculture, human health, and biodiversity. We provide the first deep investigation of the predictive potential of machine learning models to understand the connections between soil and biological phenotypes. We investigate an integrative framework performing accurate machine learning-based prediction of plant phenotypes from biological, chemical, and physical properties of the soil via two models: random forest and Bayesian neural network. We show that prediction is improved when incorporating environmental features like soil physicochemical properties and microbial population density into the models, in addition to the microbiome information. Exploring various data preprocessing strategies confirms the significant impact of human decisions on predictive performance. We show that the naive total sum scaling normalization that is commonly used in microbiome research is not the optimal strategy to maximize predictive power. Also, we find that accurately defined labels are more important than normalization, taxonomic level or model characteristics. In cases where humans are unable to classify samples accurately, machine learning model performance is limited. Lastly, we provide domain scientists via a full model selection decision tree to identify the human choices that optimize model prediction power. Our work is accompanied by open source reproducible scripts (https://github.com/solislemuslab/soil-microbiome-nn) for maximum outreach among the microbiome research community.

翻译：土壤健康保护是21世纪的关键挑战，因其对农业、人类健康和生物多样性具有重大影响。我们首次深入探究了机器学习模型在理解土壤与生物表型之间关联的预测潜力。通过两种模型——随机森林和贝叶斯神经网络——我们研究了一个集成框架，该框架基于土壤的生物、化学和物理特性对植物表型进行精准的机器学习预测。研究表明，在模型中纳入土壤理化性质、微生物种群密度等环境特征（除微生物组信息外）可提升预测性能。对多种数据预处理策略的探索证实，人类决策对预测性能具有显著影响。我们指出，微生物组研究中常用的朴素总丰度缩放归一化方法并非最大化预测能力的最优策略。此外，我们发现精确标注的标签比归一化方法、分类学层级或模型特征更为重要。当人类无法准确分类样本时，机器学习模型性能将受到限制。最后，我们为领域科学家提供了完整的模型选择决策树，以识别优化模型预测能力的人类选择方案。本研究配套开源可复现脚本（https://github.com/solislemuslab/soil-microbiome-nn），旨在最大程度惠及微生物组研究社群。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日