Simpson's paradox is an obstacle to establishing a probabilistic association between two events $a_1$ and $a_2$, given the third (lurking) random variable $B$. We focus on scenarios when the random variables $A$ (which combines $a_1$, $a_2$, and their complements) and $B$ have a common cause $C$ that need not be observed. Alternatively, we can assume that $C$ screens out $A$ from $B$. For such cases, the correct association between $a_1$ and $a_2$ is to be defined via conditioning over $C$. This setup generalizes the original Simpson's paradox: now its two contradicting options refer to two particular and different causes $C$. We show that if $B$ and $C$ are binary and $A$ is quaternary (the minimal and the most widespread situation for the Simpson's paradox), the conditioning over any binary common cause $C$ establishes the same direction of association between $a_1$ and $a_2$ as the conditioning over $B$ in the original formulation of the paradox. Thus, for the minimal common cause, one should choose the option of Simpson's paradox that assumes conditioning over $B$ and not its marginalization. The same conclusion is reached when Simpson's paradox is formulated via 3 continuous Gaussian variables: within the minimal formulation of the paradox (3 scalar continuous variables $A_1$, $A_2$, and $B$), one should choose the option with the conditioning over $B$.
翻译:辛普森悖论是在给定第三个(潜在)随机变量$B$时,建立两个事件$a_1$与$a_2$之间概率关联的障碍。我们聚焦于随机变量$A$(整合了$a_1$、$a_2$及其补事件)与$B$存在一个未必可观测的共同原因$C$的情形。或者,我们可以假设$C$将$A$与$B$屏蔽开来。对于此类情况,$a_1$与$a_2$之间的正确关联应通过以$C$为条件来定义。这一设定推广了原始的辛普森悖论:此时其两个相互矛盾的选项对应于两个特定且不同的原因$C$。我们证明,若$B$与$C$为二元变量且$A$为四元变量(辛普森悖论中最简且最普遍的情形),则对任意二元共同原因$C$的条件化所确立的$a_1$与$a_2$之间的关联方向,与悖论原始表述中以$B$为条件化所得方向一致。因此,对于最简共同原因,应选择辛普森悖论中假定以$B$为条件化的选项,而非其边缘化选项。当辛普森悖论通过3个连续高斯变量表述时,亦可得到相同结论:在悖论的最简表述(3个标量连续变量$A_1$、$A_2$与$B$)中,应选择以$B$为条件化的选项。