Algorithms for causal discovery have recently undergone rapid advances and increasingly draw on flexible nonparametric methods to process complex data. With these advances comes a need for adequate empirical validation of the causal relationships learned by different algorithms. However, for most real data sources true causal relations remain unknown. This issue is further compounded by privacy concerns surrounding the release of suitable high-quality data. To help address these challenges, we gather a complex dataset comprising measurements from an assembly line in a manufacturing context. This line consists of numerous physical processes for which we are able to provide ground truth causal relationships on the basis of a detailed study of the underlying physics. We use the assembly line data and associated ground truth information to build a system for generation of semisynthetic manufacturing data that supports benchmarking of causal discovery methods. To accomplish this, we employ distributional random forests in order to flexibly estimate and represent conditional distributions that may be combined into joint distributions that strictly adhere to a causal model over the observed variables. The estimated conditionals and tools for data generation are made available in our Python library $\texttt{causalAssembly}$. Using the library, we showcase how to benchmark several well-known causal discovery algorithms.
翻译:因果发现算法近年来进展迅速,并日益采用灵活的非参数方法处理复杂数据。随着这些进展,对不同算法所学习到的因果关系进行充分的实证验证的需求也随之增加。然而,对于大多数真实数据源,真实的因果关系仍然未知。此外,合适的优质数据发布涉及的隐私问题进一步加剧了这一挑战。为应对这些问题,我们收集了一个包含制造环境中装配线测量数据的复杂数据集。该装配线由多个物理过程组成,基于对底层物理学的详细研究,我们能够提供这些过程的真实因果关系。我们利用装配线数据及其相关的真实信息,构建了一个生成半合成制造数据的系统,以支持因果发现方法的基准测试。为实现这一目标,我们采用分布随机森林来灵活估计和表示条件分布,这些条件分布可以组合成严格遵循观测变量因果模型的联合分布。估计的条件分布及数据生成工具已收录于我们的Python库$\texttt{causalAssembly}$中。利用该库,我们展示了如何对几种知名的因果发现算法进行基准测试。