Tightening Bounds on Probabilities of Causation By Merging Datasets

Probabilities of Causation (PoC) play a fundamental role in decision-making in law, health care and public policy. Nevertheless, their point identification is challenging, requiring strong assumptions, in the absence of which only bounds can be derived. Existing work to further tighten these bounds by leveraging extra information either provides numerical bounds, symbolic bounds for fixed dimensionality, or requires access to multiple datasets that contain the same treatment and outcome variables. However, in many clinical, epidemiological and public policy applications, there exist external datasets that examine the effect of different treatments on the same outcome variable, or study the association between covariates and the outcome variable. These external datasets cannot be used in conjunction with the aforementioned bounds, since the former may entail different treatment assignment mechanisms, or even obey different causal structures. Here, we provide symbolic bounds on the PoC for this challenging scenario. We focus on combining either two randomized experiments studying different treatments, or a randomized experiment and an observational study, assuming causal sufficiency. Our symbolic bounds work for arbitrary dimensionality of covariates and treatment, and we discuss the conditions under which these bounds are tighter than existing bounds in literature. Finally, our bounds parameterize the difference in treatment assignment mechanism across datasets, allowing the mechanisms to vary across datasets while still allowing causal information to be transferred from the external dataset to the target dataset.

翻译：因果概率（Probabilities of Causation, PoC）在法律、医疗保健和公共政策等决策中发挥着基础性作用。然而，对其进行点识别极具挑战性，需要强假设条件，若缺乏这些假设则只能推导出边界值。现有通过利用额外信息进一步收紧这些边界的研究，要么仅提供数值边界，要么提供固定维度的符号边界，要么需要访问包含相同处理变量和结果变量的多个数据集。然而，在许多临床、流行病学和公共政策应用中，存在一些外部数据集——它们可能研究不同处理变量对同一结果变量的影响，或探究协变量与结果变量之间的关联。这些外部数据集无法与前述边界结合使用，因为它们可能涉及不同的处理分配机制，甚至遵循不同的因果结构。本文针对这一具有挑战性的场景，提供了因果概率的符号边界。我们聚焦于以下两种组合情况：（1）合并两个研究不同处理的随机实验；（2）合并一个随机实验与一个观察性研究，并假设因果充分性成立。我们的符号边界适用于任意维度的协变量和处理变量，同时讨论了这些边界相较于现有文献中的边界更为严格的条件。最后，我们的边界通过参数化跨数据集的处理分配机制差异，允许机制在不同数据集间变化，同时仍能将因果信息从外部数据集迁移至目标数据集。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日