Distinguishing cause from effect using observations of a pair of random variables is a core problem in causal discovery. Most approaches proposed for this task, namely additive noise models (ANM), are only adequate for quantitative data. We propose a criterion to address the cause-effect problem with categorical variables (living in sets with no meaningful order), inspired by seeing a conditional probability mass function (pmf) as a discrete memoryless channel. We select as the most likely causal direction the one in which the conditional pmf is closer to a uniform channel (UC). The rationale is that, in a UC, as in an ANM, the conditional entropy (of the effect given the cause) is independent of the cause distribution, in agreement with the principle of independence of cause and mechanism. Our approach, which we call the uniform channel model (UCM), thus extends the ANM rationale to categorical variables. To assess how close a conditional pmf (estimated from data) is to a UC, we use statistical testing, supported by a closed-form estimate of a UC channel. On the theoretical front, we prove identifiability of the UCM and show its equivalence with a structural causal model with a low-cardinality exogenous variable. Finally, the proposed method compares favorably with recent state-of-the-art alternatives in experiments on synthetic, benchmark, and real data.
翻译:利用一对随机变量的观测来区分因果方向是因果发现中的核心问题。针对此任务提出的多数方法(如加性噪声模型)仅适用于定量数据。我们提出一种准则,用于处理分类变量(属于无意义顺序的集合)中的因果方向问题,其灵感源于将条件概率质量函数视为离散无记忆信道。我们选择条件概率质量函数更接近均匀信道的方向作为最可能的因果方向。其基本原理是:在均匀信道中(如加性噪声模型),条件熵(效应给定原因)独立于原因分布,这与因果机制独立性原则一致。我们的方法称为均匀信道模型,因此将加性噪声模型原理拓展至分类变量。为评估由数据估计的条件概率质量函数与均匀信道的接近程度,我们基于均匀信道的闭合形式估计进行统计检验。在理论层面,我们证明了均匀信道模型的可辨识性,并展示其与含低基数外生变量的结构因果模型的等价性。最后,在合成数据、基准数据和真实数据上的实验表明,所提方法优于最新替代方案。