For decades, forensic statisticians have debated whether searching large DNA databases undermines the evidential value of a match. Modern surveillance faces an exponentially harder problem: screening populations across thousands of attributes using threshold rules rather than exact matching. Intuition suggests that requiring many coincidental matches should make false alerts astronomically unlikely. This intuition fails. Consider a system that monitors 1,000 attributes, each with a 0.5 percent innocent match rate. Matching 15 pre-specified attributes has probability \(10^{-35}\), one in 30 decillion, effectively impossible. But operational systems require no such specificity. They might flag anyone who matches \emph{any} 15 of the 1,000. In a city of one million innocent people, this produces about 226 false alerts. A seemingly impossible event becomes all but guaranteed. This is not an implementation flaw but a mathematical consequence of high-dimensional screening. We identify fundamental probabilistic limits on screening reliability. Systems undergo sharp transitions from reliable to unreliable with small increases in data scale, a fragility worsened by data growth and correlations. As data accumulate and correlation collapses effective dimensionality, systems enter regimes where alerts lose evidential value even when individual coincidences remain vanishingly rare. This framework reframes the DNA database controversy as a shift between operational regimes. Unequal surveillance exposures magnify failure, making ``structural bias'' mathematically inevitable. These limits are structural: beyond a critical scale, failure cannot be prevented through threshold adjustment or algorithmic refinement.
翻译:数十年来,法证统计学家一直在争论搜索大型DNA数据库是否会削弱匹配结果的证据价值。现代监控面临着一个指数级更困难的问题:使用阈值规则而非精确匹配,在数千个属性上对人群进行筛查。直觉上,要求多个巧合匹配应该会使误报的可能性变得极其微小。然而这种直觉是错误的。考虑一个监控1000个属性的系统,每个属性的无辜匹配率为0.5%。匹配15个预先指定的属性的概率为 \(10^{-35}\),即三百亿亿亿分之一,实际上不可能发生。但实际运行的系统并不需要这种特异性。它们可能标记出匹配1000个属性中任意15个的任何人。在一个有一百万无辜人口的城市中,这会产生大约226个误报。一个看似不可能的事件几乎必然发生。这不是实施缺陷,而是高维筛查的数学结果。我们确定了筛查可靠性的基本概率极限。系统会随着数据规模的微小增加,经历从可靠到不可靠的急剧转变,这种脆弱性因数据增长和相关性而加剧。随着数据的积累和相关关系降低了有效维度,系统会进入一种状态,即使个体巧合事件仍然极其罕见,警报也会失去证据价值。该框架将DNA数据库的争议重新定义为不同运行状态之间的转换。不平等的监控暴露会放大失效,使得“结构偏差”在数学上不可避免。这些极限是结构性的:超过临界规模后,无法通过调整阈值或算法优化来防止失效。