Causal inference is fundamental to empirical scientific discoveries in natural and social sciences; however, in the process of conducting causal inference, data management problems can lead to false discoveries. Two such problems are (i) not having all attributes required for analysis, and (ii) misidentifying which attributes are to be included in the analysis. Analysts often only have access to partial data, and they critically rely on (often unavailable or incomplete) domain knowledge to identify attributes to include for analysis, which is often given in the form of a causal DAG. We argue that data management techniques can surmount both of these challenges. In this work, we introduce the Causal Data Integration (CDI) problem, in which unobserved attributes are mined from external sources and a corresponding causal DAG is automatically built. We identify key challenges and research opportunities in designing a CDI system, and present a system architecture for solving the CDI problem. Our preliminary experimental results demonstrate that solving CDI is achievable and pave the way for future research.
翻译:因果推断是自然和社会科学中实证科学发现的重要基础;然而,在因果推断过程中,数据管理问题可能导致错误发现。其中两个问题是:(i) 缺乏分析所需的所有属性,(ii) 错误识别应纳入分析的属性。分析人员通常只能访问部分数据,并严重依赖(通常不可用或不完整的)领域知识来确定应纳入分析的属性,这些知识通常以因果有向无环图(DAG)的形式提供。我们认为数据管理技术能够克服这两个挑战。在这项工作中,我们引入了因果数据集成(CDI)问题,其中从外部数据源挖掘未观测到的属性,并自动构建相应的因果DAG。我们识别了设计CDI系统的关键挑战和研究机遇,并提出了解决CDI问题的系统架构。初步实验结果表明,解决CDI问题是可行的,并为未来研究铺平了道路。