Causal representation learning seeks to uncover causal relationships among high-level latent variables from low-level, entangled, and noisy observations. Existing approaches often either rely on deep neural networks, which lack interpretability and formal guarantees, or impose restrictive assumptions like linearity, continuous-only observations, and strong structural priors. These limitations particularly challenge applications with a large number of discrete latent variables and mixed-type observations. To address these challenges, we propose discrete causal representation learning (DCRL), a generative framework that models a directed acyclic graph among discrete latent variables, along with a sparse bipartite graph linking latent and observed layers. This design accommodates continuous, count, and binary responses through flexible measurement models while maintaining interpretability. Under mild conditions, we prove that both the bipartite measurement graph and the latent causal graph are identifiable from the observed data distribution alone. We further propose a three-stage estimate-resample-discovery pipeline: penalized estimation of the generative model parameters, resampling of latent configurations from the fitted model, and score-based causal discovery on the resampled latents. We establish the consistency of this procedure, ensuring reliable recovery of the latent causal structure. Empirical studies on educational assessment and synthetic image data demonstrate that DCRL recovers sparse and interpretable latent causal structures.
翻译:因果表示学习旨在从低层次、纠缠且含噪声的观测中揭示高层潜变量间的因果关系。现有方法要么依赖缺乏可解释性与形式化保障的深度神经网络,要么施加线性、仅连续观测及强结构先验等限制性假设。这些局限对存在大量离散潜变量与混合类型观测的应用场景构成挑战。为此,我们提出离散因果表示学习(DCRL)——一种生成式框架,通过对离散潜变量间的有向无环图建模,并构建连接潜层与观测层的稀疏二分图。该设计通过灵活的测量模型支持连续、计数及二元响应,同时保持可解释性。在温和条件下,我们证明仅凭观测数据分布即可识别二分测量图与潜因果图。进一步提出三阶段"估计-重采样-发现"流程:对生成模型参数进行惩罚估计、从拟合模型中对潜配置进行重采样、基于分数的因果发现算法作用于重采样潜变量。我们证明了该流程的一致性,确保潜因果结构的可靠恢复。在教育评估与合成图像数据上的实证研究表明,DCRL能够恢复稀疏且可解释的潜因果结构。