Different statistical samples (e.g., from different locations) offer populations and learning systems observations with distinct statistical properties. Samples under (1) 'Unconfounded' growth preserve systems' ability to determine the independent effects of their individual variables on any outcome-of-interest (and lead, therefore, to fair and interpretable black-box predictions). Samples under (2) 'Externally-Valid' growth preserve their ability to make predictions that generalize across out-of-sample variation. The first promotes predictions that generalize over populations, the second over their shared exogeneous factors. We illustrate these theoretic patterns in the full American census from 1840 to 1940, and samples ranging from the street-level all the way to the national. This reveals sample requirements for generalizability over space, and new connections among the Shapley value, U-Statistics (Unbiased Statistics), and Hyperbolic Geometry.
翻译:不同统计样本(例如来自不同位置的样本)为总体和学习系统提供具有不同统计特性的观测数据。在(1)“无混杂”增长下的样本保留了系统确定其各个变量对任意感兴趣结果的独立影响的能力(因此能产生公平且可解释的黑箱预测)。在(2)“外部有效”增长下的样本则保留了系统做出泛化至样本外变异的预测能力。前者促进跨总体的预测泛化,后者则促进跨共享外生因素的预测泛化。我们通过1840年至1940年的全美人口普查数据,以及从街道级到国家级的各类样本,阐释了这些理论模式。这一分析揭示了实现空间泛化所需的样本条件,并建立了夏普利值、U统计量(无偏统计量)与双曲几何之间的新关联。