Simpson's paradox is an obstacle to establishing a probabilistic association between two events $a_1$ and $a_2$, given the third (lurking) random variable $B$. We focus on scenarios when the random variables $A$ (which combines $a_1$, $a_2$, and their complements) and $B$ have a common cause $C$ that need not be observed. Alternatively, we can assume that $C$ screens out $A$ from $B$. For such cases, the correct association between $a_1$ and $a_2$ is to be defined via conditioning over $C$. This set-up generalizes the original Simpson's paradox. Now its two contradicting options simply refer to two particular and different causes $C$. We show that if $B$ and $C$ are binary and $A$ is quaternary (the minimal and the most widespread situation for valid Simpson's paradox), the conditioning over any binary common cause $C$ establishes the same direction of the association between $a_1$ and $a_2$ as the conditioning over $B$ in the original formulation of the paradox. Thus, for the minimal common cause, one should choose the option of Simpson's paradox that assumes conditioning over $B$ and not its marginalization. For tertiary (unobserved) common causes $C$ all three options of Simpson's paradox become possible (i.e. marginalized, conditional, and none of them), and one needs prior information on $C$ to choose the right option.
翻译:辛普森悖论是在给定第三个(潜在)随机变量 $B$ 时,建立两个事件 $a_1$ 和 $a_2$ 之间概率关联的障碍。我们聚焦于随机变量 $A$(包含 $a_1$、$a_2$ 及其补集)与 $B$ 存在一个无需观测的共同原因 $C$ 的场景。另一种假设是,$C$ 将 $A$ 从 $B$ 中屏蔽。在此类情况下,$a_1$ 与 $a_2$ 之间的正确关联应通过基于 $C$ 的条件化来定义。这一设定推广了原始辛普森悖论,其两个矛盾选项仅对应于两个特定的不同原因 $C$。我们证明:若 $B$ 和 $C$ 为二值变量,而 $A$ 为四值变量(有效辛普森悖论的最小且最普遍情形),则基于任意二值共同原因 $C$ 的条件化所建立的 $a_1$ 与 $a_2$ 关联方向,与原始悖论中基于 $B$ 的条件化一致。因此,对于最小共同原因,应选择辛普森悖论中采用基于 $B$ 条件化而非边缘化的选项。对于三值(未观测)共同原因 $C$,辛普森悖论的所有三个选项(边缘化、条件化及两者皆否)均可能成立,此时需要关于 $C$ 的先验信息以选择正确选项。