This work is motivated by the following problem: Can we identify the disease-causing gene in a patient affected by a monogenic disorder? This problem is an instance of root cause discovery. In particular, we aim to identify the intervened variable in one interventional sample using a set of observational samples as reference. We consider a linear structural equation model where the causal ordering is unknown. We begin by examining a simple method that uses squared z-scores and characterize the conditions under which this method succeeds and fails, showing that it generally cannot identify the root cause. We then prove, without additional assumptions, that the root cause is identifiable even if the causal ordering is not. Two key ingredients of this identifiability result are the use of permutations and the Cholesky decomposition, which allow us to exploit an invariant property across different permutations to discover the root cause. Furthermore, we characterize permutations that yield the correct root cause and, based on this, propose a valid method for root cause discovery. We also adapt this approach to high-dimensional settings. Finally, we evaluate the performance of our methods through simulations and apply the high-dimensional method to discover disease-causing genes in the gene expression dataset that motivates this work.
翻译:本研究源于以下问题:能否在患有单基因疾病的患者体内识别致病基因?该问题是根因发现问题的具体实例。具体而言,我们旨在利用一组观测样本作为参照,从单个干预样本中识别受干预变量。我们考虑一个因果顺序未知的线性结构方程模型。首先检验一种基于平方z分数的简易方法,并刻画该方法成功与失败的条件,证明其通常无法识别根因。随后我们在不增加额外假设的前提下,证明即使因果顺序未知,根因仍具有可识别性。该可识别性结果的两个关键要素是置换操作与Cholesky分解的运用,使我们能够通过不同置换下的不变性质来发现根因。此外,我们刻画了能得出正确根因的置换特征,并据此提出一种有效的根因发现方法。我们还将此方法适配至高维场景。最后通过仿真实验评估所提方法的性能,并将高维方法应用于激发本研究的基因表达数据集,成功识别出致病基因。