Machine learning (ML) methods are having a huge impact across all of the sciences. However, ML has a strong ontology - in which only the data exist - and a strong epistemology - in which a model is considered good if it performs well on held-out training data. These philosophies are in strong conflict with both standard practices and key philosophies in the natural sciences. Here, we identify some locations for ML in the natural sciences at which the ontology and epistemology are valuable. For example, when an expressive machine learning model is used in a causal inference to represent the effects of confounders, such as foregrounds, backgrounds, or instrument calibration parameters, the model capacity and loose philosophy of ML can make the results more trustworthy. We also show that there are contexts in which the introduction of ML introduces strong, unwanted statistical biases. For one, when ML models are used to emulate physical (or first-principles) simulations, they introduce strong confirmation biases. For another, when expressive regressions are used to label datasets, those labels cannot be used in downstream joint or ensemble analyses without taking on uncontrolled biases. The question in the title is being asked of all of the natural sciences; that is, we are calling on the scientific communities to take a step back and consider the role and value of ML in their fields; the (partial) answers we give here come from the particular perspective of physics.
翻译:机器学习(ML)方法正在对整个科学领域产生巨大影响。然而,机器学习具有强烈的本体论——即只有数据存在,以及强烈的认识论——即如果一个模型在预留的训练数据上表现良好,就被认为是优秀的。这些理念与自然科学的常规实践和核心哲学存在强烈冲突。在此,我们指出了机器学习在自然科学中某些本体论和认识论具有价值的应用场景。例如,当在因果推断中使用表达能力强的机器学习模型来表示混杂因素(如前景、背景或仪器校准参数)的影响时,机器学习模型的容量和宽松理念可以使结果更值得信赖。我们也证明了在某些情境下,引入机器学习会带来强烈且不受欢迎的统计偏差。一方面,当机器学习模型被用于模拟物理(或第一性原理)仿真时,它们会引入强烈的确认偏差。另一方面,当使用表达能力强的回归模型来标注数据集时,这些标签若未经处理偏差控制,则无法用于下游的联合或集成分析。标题中的问题正被抛向所有自然科学领域;也就是说,我们呼吁科学界退后一步,思考机器学习在其领域中的作用和价值;我们在此给出的(部分)答案来自物理学的特定视角。