Additive Noise Models (ANM) encode a popular functional assumption that enables learning causal structure from observational data. Due to a lack of real-world data meeting the assumptions, synthetic ANM data are often used to evaluate causal discovery algorithms. Reisach et al. (2021) show that, for common simulation parameters, a variable ordering by increasing variance is closely aligned with a causal order and introduce var-sortability to quantify the alignment. Here, we show that not only variance, but also the fraction of a variable's variance explained by all others, as captured by the coefficient of determination $R^2$, tends to increase along the causal order. Simple baseline algorithms can use $R^2$-sortability to match the performance of established methods. Since $R^2$-sortability is invariant under data rescaling, these algorithms perform equally well on standardized or rescaled data, addressing a key limitation of algorithms exploiting var-sortability. We characterize and empirically assess $R^2$-sortability for different simulation parameters. We show that all simulation parameters can affect $R^2$-sortability and must be chosen deliberately to control the difficulty of the causal discovery task and the real-world plausibility of the simulated data. We provide an implementation of the sortability measures and sortability-based algorithms in our library CausalDisco (https://github.com/CausalDisco/CausalDisco).
翻译:加性噪声模型(ANM)编码了一种常见的函数假设,从而能够从观测数据中学习因果关系。由于缺乏满足假设的真实世界数据,合成ANM数据常被用于评估因果发现算法。Reisach等人(2021)指出,在常见模拟参数下,按方差递增的变量排序与因果顺序高度一致,并引入方差可排序性(var-sortability)来量化这种一致性。本文表明,不仅方差,而且单个变量被其他变量所解释的方差比例——由决定系数$R^2$度量——也倾向于沿因果顺序递增。简单的基线算法可利用$R^2$可排序性($R^2$-sortability)达到与现有方法相当的性能。由于$R^2$可排序性在数据重缩放下具有不变性,这些算法在标准化或重缩放数据上同样表现良好,从而解决了利用方差可排序性算法的一个关键局限性。我们刻画并实证评估了不同模拟参数下的$R^2$可排序性。研究表明,所有模拟参数均可能影响$R^2$可排序性,因此必须审慎选择这些参数,以控制因果发现任务的难度及模拟数据在真实世界中的合理性。我们已在CausalDisco库(https://github.com/CausalDisco/CausalDisco)中提供了可排序性度量及基于可排序性的算法的实现。