When domain knowledge is limited and experimentation is restricted by ethical, financial, or time constraints, practitioners turn to observational causal discovery methods to recover the causal structure, exploiting the statistical properties of their data. Because causal discovery without further assumptions is an ill-posed problem, each algorithm comes with its own set of usually untestable assumptions, some of which are hard to meet in real datasets. Motivated by these considerations, this paper extensively benchmarks the empirical performance of recent causal discovery methods on observational i.i.d. data generated under different background conditions, allowing for violations of the critical assumptions required by each selected approach. Our experimental findings show that score matching-based methods demonstrate surprising performance in the false positive and false negative rate of the inferred graph in these challenging scenarios, and we provide theoretical insights into their performance. This work is also the first effort to benchmark the stability of causal discovery algorithms with respect to the values of their hyperparameters. Finally, we hope this paper will set a new standard for the evaluation of causal discovery methods and can serve as an accessible entry point for practitioners interested in the field, highlighting the empirical implications of different algorithm choices.
翻译:当领域知识有限且实验受到伦理、财务或时间限制时,实践者转向观测性因果发现方法,利用数据的统计特性来恢复因果结构。由于在没有进一步假设的情况下进行因果发现是一个不适定问题,每种算法都附带其自身一系列通常无法检验的假设,其中一些假设在真实数据集中难以满足。基于这些考虑,本文在不同背景条件下生成的观测性独立同分布数据上,广泛基准测试了近期因果发现方法的实证性能,允许违反每种选定方法所需的关键假设。我们的实验结果表明,基于评分匹配的方法在这些具有挑战性的场景中,在推断图的假阳性率和假阴性率方面表现出令人惊讶的性能,并对其性能提供了理论见解。这项工作也是首次尝试基准测试因果发现算法关于其超参数值的稳定性。最后,我们希望本文能为因果发现方法的评估设定新标准,并可作为对该领域感兴趣的实践者一个易于理解的入门指南,突显不同算法选择的实证意义。