One of the fundamental challenges found throughout the data sciences is to explain why things happen in specific ways, or through which mechanisms a certain variable $X$ exerts influences over another variable $Y$. In statistics and machine learning, significant efforts have been put into developing machinery to estimate correlations across variables efficiently. In causal inference, a large body of literature is concerned with the decomposition of causal effects under the rubric of mediation analysis. However, many variations are spurious in nature, including different phenomena throughout the applied sciences. Despite the statistical power to estimate correlations and the identification power to decompose causal effects, there is still little understanding of the properties of spurious associations and how they can be decomposed in terms of the underlying causal mechanisms. In this manuscript, we develop formal tools for decomposing spurious variations in both Markovian and Semi-Markovian models. We prove the first results that allow a non-parametric decomposition of spurious effects and provide sufficient conditions for the identification of such decompositions. The described approach has several applications, ranging from explainable and fair AI to questions in epidemiology and medicine, and we empirically demonstrate its use on a real-world dataset.
翻译:数据科学中一个基本挑战是解释事物为何以特定方式发生,或变量 $X$ 通过何种机制对另一个变量 $Y$ 施加影响。在统计学和机器学习中,人们已投入大量努力开发高效估计变量间相关性的工具。在因果推断领域,大量文献关注于在中介分析框架下分解因果效应。然而,应用科学中存在多种现象,其本质为虚假变异。尽管我们具备估计相关性的统计能力和分解因果效应的识别能力,但对虚假关联的特性及其如何依据潜在因果机制进行分解的理解仍十分有限。本文中,我们为马尔可夫模型和半马尔可夫模型中的虚假变异分解开发了形式化工具。我们首次证明了允许非参数分解虚假效应的结果,并提供了此类分解可识别的充分条件。所述方法具有多种应用场景,涵盖可解释与公平人工智能,以及流行病学和医学中的问题,并通过对真实数据集的实证研究展示了其应用。