Machine learning (ML) is set to accelerate innovations in clinical microbiomics, such as in disease diagnostics and prognostics. This will require high-quality, reproducible, interpretable workflows whose predictive capabilities meet or exceed the high thresholds set for clinical tools by regulatory agencies. Here, we capture a snapshot of current practices in the application of supervised ML to microbiomics data, through an in-depth analysis of 100 peer-reviewed journal articles published in 2021-2022. We apply a data-driven approach to steer discussion of the merits of varied approaches to experimental design, including key considerations such as how to mitigate the effects of small dataset size while avoiding data leakage. We further provide guidance on how to avoid common experimental design pitfalls that can hurt model performance, trustworthiness, and reproducibility. Discussion is accompanied by an interactive online tutorial that demonstrates foundational principles of ML experimental design, tailored to the microbiomics community. Formalizing community best practices for supervised ML in microbiomics is an important step towards improving the success and efficiency of clinical research, to the benefit of patients and other stakeholders.
翻译:机器学习(ML)有望加速临床微生物组学领域的创新,例如在疾病诊断和预后方面。这需要高质量、可重复、可解释的工作流程,其预测能力需达到或超过监管机构为临床工具设定的高门槛。本文通过对2021-2022年间发表的100篇同行评审期刊文章进行深入分析,捕捉了监督式机器学习应用于微生物组学数据的当前实践概况。我们采用数据驱动的方法来探讨不同实验设计方法的优劣,包括关键考量因素,例如如何在避免数据泄露的同时缓解小数据集规模的影响。我们进一步提供指导,说明如何避免可能损害模型性能、可信度和可重复性的常见实验设计缺陷。讨论内容辅以交互式在线教程,该教程展示了机器学习实验设计的基本原理,并针对微生物组学领域进行了定制。为微生物组学中的监督式机器学习制定社区最佳实践,是提高临床研究成功率和效率的重要一步,这将惠及患者及其他利益相关者。