New proposals for causal discovery algorithms are typically evaluated using simulations and a few select real data examples with known data generating mechanisms. However, there does not exist a general guideline for how such evaluation studies should be designed, and therefore, comparing results across different studies can be difficult. In this article, we propose a common evaluation baseline by posing the question: Are we doing better than random guessing? For the task of graph skeleton estimation, we derive exact distributional results under random guessing for the expected behavior of a range of typical causal discovery evaluation metrics (including precision and recall). We show that these metrics can achieve very large values under random guessing in certain scenarios, and hence warn against using them without also reporting negative control results, i.e., performance under random guessing. We also propose an exact test of overall skeleton fit, and showcase its use on a real data application. Finally, we propose a general pipeline for using random controls beyond the skeleton estimation task, and apply it both in a simulated example and a real data application.
翻译:新的因果发现算法提案通常通过模拟和少数已知数据生成机制的真实数据案例进行评估。然而,目前并不存在关于此类评估研究应如何设计的一般性指南,因此,比较不同研究的结果可能较为困难。在本文中,我们通过提出以下问题建立了一个共同的评估基线:我们是否比随机猜测表现更优?针对图骨架估计任务,我们推导了在随机猜测下一系列典型因果发现评估指标(包括精确率和召回率)期望行为的精确分布结果。我们证明,在某些场景下,这些指标在随机猜测下可能达到非常大的值,因此警示不应在未同时报告阴性对照结果(即随机猜测下的性能)的情况下使用这些指标。我们还提出了一种整体骨架拟合的精确检验方法,并在真实数据应用中展示了其用途。最后,我们提出了一个超越骨架估计任务的随机对照通用流程,并在模拟示例和真实数据应用中进行了验证。