On Data Thinning for Model Validation in Small Area Estimation

Small area estimation (SAE) produces estimates of population parameters for geographic and demographic subgroups with limited sample sizes. Such estimates are critical for informing policy decisions, ranging from poverty mapping to social program funding. Despite its widespread use, principled validation of SAE models remains challenging and general guidelines are far from well-established. Unlike conventional predictive modeling settings, validation data are rarely available in the SAE context. External validation surveys or censuses often do not exist, and access to individual-level microdata is often restricted, making standard cross-validation infeasible. In this paper, we propose a novel model validation scheme using only area-level direct survey estimates under the widely used Fay--Herriot model. Our approach is based on data thinning, which splits area-level observations into independent training and test components to enable out-of-sample validation. Our theoretical analysis reveals a fundamental tension inherent in thinning-based validating: performance metrics measured on the thinned training component targets a different quantity than that based on the full data, with the gap varying by model complexity. Increasing the information allocated for training reduces this gap but inflates the variance of the estimator. We formally characterize this bias-variance tradeoff and provide practical recommendations for the thinning parameters that balance these competing considerations for model comparison. We show that data thinning with these settings provides consistent and stable performance across heterogeneous sampling designs in design-based simulations using American Community Survey microdata.

翻译：小区域估计（SAE）通过有限样本量生成地理和人口子群的总体参数估计值，这类估计对从贫困制图到社会项目资金投入等政策制定具有关键指导作用。尽管应用广泛，但SAE模型的原则性验证仍面临挑战，尚未形成成熟通用准则。与传统预测建模场景不同，SAE验证数据通常难以获取：外部验证调查或普查往往不存在，且个体层面微观数据的访问权限常受限制，导致标准交叉验证不可行。本文提出一种新颖的模型验证方案，仅需使用广泛应用的费-亨德森模型下的区域层面直接调查估计量。该方法基于数据稀疏化技术，将区域观测值拆分为独立的训练集与测试集，从而支持样本外验证。理论分析揭示了基于稀疏化验证的内在张力：基于稀疏训练分量计算的性能指标与基于完整数据的指标存在系统性差异，且差异幅度随模型复杂度变化。增加训练信息分配可缩小该差异，但会放大估计量方差。我们正式刻画了这一偏差-方差权衡关系，并为模型比较中平衡这些竞争性考量提供了稀疏化参数的实际建议。基于美国社区调查微观数据的设计模拟表明，采用优化参数的稀疏化方法在异质性抽样设计下均能提供一致且稳定的性能表现。