Algorithms for causal discovery have recently undergone rapid advances and increasingly draw on flexible nonparametric methods to process complex data. With these advances comes a need for adequate empirical validation of the causal relationships learned by different algorithms. However, for most real data sources true causal relations remain unknown. This issue is further compounded by privacy concerns surrounding the release of suitable high-quality data. To help address these challenges, we gather a complex dataset comprising measurements from an assembly line in a manufacturing context. This line consists of numerous physical processes for which we are able to provide ground truth causal relationships on the basis of a detailed study of the underlying physics. We use the assembly line data and associated ground truth information to build a system for generation of semisynthetic manufacturing data that supports benchmarking of causal discovery methods. To accomplish this, we employ distributional random forests in order to flexibly estimate and represent conditional distributions that may be combined into joint distributions that strictly adhere to a causal model over the observed variables. The estimated conditionals and tools for data generation are made available in our Python library $\texttt{causalAssembly}$. Using the library, we showcase how to benchmark several well-known causal discovery algorithms.
翻译:因果发现算法近年来取得了快速发展,并日益借助灵活的非参数方法处理复杂数据。随着这些进展,对不同算法所学习到的因果关系进行充分的实证验证变得愈发必要。然而,对于大多数真实数据源而言,真实的因果关系仍属未知。这一问题因高质量数据的发布受到隐私担忧而进一步加剧。为应对这些挑战,我们收集了一个包含制造环境中装配线测量值的复杂数据集。该装配线由众多物理过程组成,我们基于对底层物理过程的详细研究,能够为这些过程提供真实的因果关系。我们利用该装配线数据及其对应的真实因果关系信息,构建了一个生成半合成制造数据的系统,以支持因果发现方法的基准测试。为此,我们采用分布随机森林来灵活估计并表示条件分布,这些条件分布可组合成严格遵循观测变量因果模型的联合分布。估计的条件分布及数据生成工具已集成至我们的Python库$\texttt{causalAssembly}$中。利用该库,我们展示了如何对若干著名因果发现算法进行基准测试。