Barlow (1985) hypothesized that the co-occurrence of two events $A$ and $B$ is "suspicious" if $P(A,B) \gg P(A) P(B)$. We first review classical measures of association for $2 \times 2$ contingency tables, including Yule's $Y$ (Yule, 1912), which depends only on the odds ratio $\lambda$, and is independent of the marginal probabilities of the table. We then discuss the mutual information (MI) and pointwise mutual information (PMI), which depend on the ratio $P(A,B)/P(A)P(B)$, as measures of association. We show that, once the effect of the marginals is removed, MI and PMI behave similarly to $Y$ as functions of $\lambda$. The pointwise mutual information is used extensively in some research communities for flagging suspicious coincidences, but it is important to bear in mind the sensitivity of the PMI to the marginals, with increased scores for sparser events.
翻译:Barlow(1985)提出,若两个事件$A$与$B$满足$P(A,B) \gg P(A) P(B)$,则其共现具有"可疑性"。本文首先回顾了$2 \times 2$列联表的经典关联度量,包括仅依赖于比值比$\lambda$且与列联表边际概率无关的Yule's $Y$(Yule,1912)。随后,我们讨论了基于比值$P(A,B)/P(A)P(B)$的互信息(MI)与逐点互信息(PMI)作为关联度量的性质。研究表明,在移除边际效应后,MI与PMI作为$\lambda$的函数表现出与$Y$相似的特性。逐点互信息在某些研究领域被广泛用于标记可疑巧合,但需注意PMI对边际概率的敏感性——稀疏事件的得分会显著升高。