Federated Causal Discovery Across Heterogeneous Datasets under Latent Confounding

Causal discovery across multiple datasets is often constrained by data privacy regulations and cross-site heterogeneity, limiting the use of conventional methods that require a single, centralized dataset. To address these challenges, we introduce fedCI, a federated conditional independence test that rigorously handles heterogeneous datasets with non-identical sets of variables, site-specific effects, and mixed variable types, including continuous, ordinal, binary, and categorical variables. At its core, fedCI uses a federated Iteratively Reweighted Least Squares (IRLS) procedure to estimate the parameters of generalized linear models underlying likelihood-ratio tests for conditional independence. Building on this, we develop fedCI-IOD, a federated extension of the Integration of Overlapping Datasets (IOD) algorithm, that replaces its meta-analysis strategy and enables, for the fist time, federated causal discovery under latent confounding across distributed and heterogeneous datasets. By aggregating evidence federatively, fedCI-IOD not only preserves privacy but also substantially enhances statistical power, achieving performance comparable to fully pooled analyses and mitigating artifacts from low local sample sizes. Our tools are publicly available as the fedCI Python package, a privacy-preserving R implementation of IOD, and a web application for the fedCI-IOD pipeline, providing versatile, user-friendly solutions for federated conditional independence testing and causal discovery.

翻译：多数据集间的因果发现常受数据隐私法规与跨站点异质性的制约，传统方法需依赖单一集中式数据集，其应用因此受限。为应对这些挑战，我们提出fedCI——一种联邦条件独立性检验方法，能够严格处理变量集合非一致、存在站点特异性效应且包含连续型、有序型、二元型及分类型混合变量的异构数据集。fedCI的核心采用联邦迭代重加权最小二乘法（IRLS）程序，通过估计广义线性模型的参数来支持条件独立性的似然比检验。在此基础上，我们进一步开发了fedCI-IOD，作为重叠数据集集成算法（IOD）的联邦扩展版本。该方法替代了原有的荟萃分析策略，首次实现了在潜变量混杂条件下对分布式异构数据集进行联邦因果发现。通过联邦式证据聚合，fedCI-IOD不仅保护了数据隐私，还显著提升了统计功效，其性能可与完全集中式分析相媲美，并有效缓解了局部样本量不足导致的伪影问题。我们的工具已公开提供，包括fedCI Python软件包、隐私保护的IOD算法R语言实现，以及fedCI-IOD流程的Web应用程序，为联邦条件独立性检验与因果发现提供了多功能、用户友好的解决方案。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR2024】视频异常事件因果关系理解数据集、评价基准和多模态大模型

专知会员服务

25+阅读 · 2024年5月24日