Machine learning techniques are now routinely encountered in research laboratories across the globe. Impressive progress has been made through ML and AI techniques with regards to large data set processing. This progress has increased the ability of the experimenter to digest data and make novel predictions regarding phenomena of interest. However, machine learning predictors generated from data sets taken from the natural sciences are often treated as black boxes which are used broadly and generally without detailed consideration of the causal structure of the data set of interest. Work has been attempted to bring causality into discussions of machine learning models of natural phenomena; however, a firm and unified theoretical treatment is lacking. This series of three papers explores the union of chemical theory, biological theory, probability theory and causality that will correct current causal flaws of machine learning in the natural sciences. This paper, Part 1 of the series, provides the formal framework of the foundational causal structure of phenomena in chemical biology and is extended to machine learning through the novel concept of focus, defined here as the ability of a machine learning algorithm to narrow down to a hidden underpinning mechanism in large data sets. Initial proof of these principles on a family of Akt inhibitors is also provided. The second paper containing Part 2 will provide a formal exploration of chemical similarity, and Part 3 will present extensive experimental evidence of how hidden causal structures weaken all machine learning in chemical biology. This series serves to establish for chemical biology a new kind of mathematical framework for modeling mechanisms in Nature without the need for the tools of reductionism: inferential mechanics.
翻译:机器学习技术如今在全球研究实验室中已司空见惯。通过ML和AI技术,在大规模数据集处理方面取得了显著进展。这一进展增强了实验者消化数据并对感兴趣现象做出新颖预测的能力。然而,基于自然科学数据集生成的机器学习预测器常被视为黑箱,被广泛且普遍地使用,而缺乏对目标数据集因果结构的细致考量。已有研究尝试将因果性引入自然现象机器学习模型的讨论,但仍缺乏坚实统一的理论处理。本系列三篇论文探索化学理论、生物理论、概率论与因果性的结合,以纠正当前自然科学中机器学习的因果缺陷。本文作为系列第一部分,提供了化学生物学现象基础因果结构的理论框架,并通过聚焦这一新概念——定义为机器学习算法在大数据集中缩小至隐藏底层机制的能力——将其扩展至机器学习领域。同时提供了这些原理在Akt抑制剂家族上的初步验证。包含第二部分的后续论文将正式探讨化学相似性,第三部分则将通过大量实验证据展示隐藏因果结构如何削弱化学生物学中的所有机器学习应用。本系列旨在为化学生物学建立一种新型数学框架,用于建模自然机制而无需依赖还原论工具:即推断力学。