Data-driven astrophysics currently relies on the detection and characterisation of correlations between objects' properties, which are then used to test physical theories that make predictions for them. This process fails to utilise information in the data that forms a crucial part of the theories' predictions, namely which variables are directly correlated (as opposed to accidentally correlated through others), the directions of these determinations, and the presence or absence of confounders that correlate variables in the dataset but are themselves absent from it. We propose to recover this information through causal discovery, a well-developed methodology for inferring the causal structure of datasets that is however almost entirely unknown to astrophysics. We develop a causal discovery algorithm suitable for large astrophysical datasets and illustrate it on $\sim$4.5$\times10^5$ nearby galaxies from the Nasa Sloan Atlas, demonstrating its ability to distinguish physical mechanisms that are degenerate on the basis of correlations alone.
翻译:当前数据驱动的天体物理学主要依赖于探测和表征天体属性之间的相关性,这些相关性随后被用于检验对其做出预测的物理理论。然而,这一过程未能充分利用数据中构成理论预测关键部分的信息,即哪些变量是直接相关的(而非通过其他变量偶然相关)、这些决定关系的方向,以及是否存在混杂因子——这些因子会使数据集中的变量产生关联但其本身却未被包含在数据集中。我们提出通过因果发现来恢复这些信息,这是一种用于推断数据集因果结构的成熟方法,但在天体物理学领域几乎完全不为人知。我们开发了一种适用于大型天体物理数据集的因果发现算法,并以美国宇航局斯隆数字巡天星表中的约4.5×10^5个邻近星系为例进行演示,证明该算法能够区分仅基于相关性分析时存在简并性的物理机制。