In nature, events that affect some individuals or groups but not others constitute an implicit intervention and are known as natural experiments. For example, the COVID-19 pandemic was an intervention by the coronavirus on the sub-population infected with COVID. We ask, do natural experiments occur in existing real-world datasets? If yes, how should we treat them? To detect natural experiments in data, we use causal discovery to recover the underlying causal graph and perform feature selection based on causal links. If downstream performance improves by treating the data as interventional rather than observational, we argue that this suggests the dataset contains natural experiments. We first validate this hypothesis by simulating datasets with and without natural experiments using synthetic graphs. We then perform a systematic empirical evaluation on a large suite of real-world datasets. Our results indicate that real-world datasets do contain natural experiments and we can take advantage of those natural experiments to improve model performance using causal inference. Our work represents the initial foray into this area, offering a preliminary exploration within a limited scope.
翻译:在自然界中,影响部分个体或群体但不影响其他个体或群体的事件构成了隐性干预,被称为自然实验。例如,COVID-19大流行是冠状病毒对感染该病毒的人群进行的一次干预。我们提出疑问:现有真实世界数据集中是否包含自然实验?如果包含,应如何处理这些数据?为检测数据中的自然实验,我们利用因果发现恢复潜在因果图,并基于因果关系进行特征选择。若将数据视为干预性而非观察性可提升下游任务性能,则可认为该数据集包含自然实验。我们首先通过合成图模拟包含与不包含自然实验的数据集来验证这一假设,随后对大量真实世界数据集进行系统性实证评估。结果表明真实世界数据集确实包含自然实验,且可通过因果推断利用这些自然实验提升模型性能。本研究是该领域的初次尝试,在有限范围内进行了初步探索。