This paper proposes small area estimation methods that utilize generalized tree-based machine learning techniques to improve the estimation of disaggregated means in small areas using discrete survey data. Specifically, we present two approaches based on random forests: the Generalized Mixed Effects Random Forest (GMERF) and a Mixed Effects Random Forest (MERF), both tailored to address challenges associated with count outcomes, particularly overdispersion. Our analysis reveals that the MERF, which does not assume a Poisson distribution to model the mean behavior of count data, excels in scenarios of severe overdispersion. Conversely, the GMERF performs best under conditions where Poisson distribution assumptions are moderately met. Additionally, we introduce and evaluate three bootstrap methodologies - one parametric and two non-parametric - designed to assess the reliability of point estimators for area-level means. The effectiveness of these methodologies is tested through model-based (and design-based) simulations and applied to a real-world dataset from the state of Guerrero in Mexico, demonstrating their robustness and potential for practical applications.
翻译:本文提出利用基于广义树结构的机器学习技术的小区域估计方法,以改进使用离散调查数据对小区域细分均值的估计。具体而言,我们提出了两种基于随机森林的方法:广义混合效应随机森林(GMERF)和混合效应随机森林(MERF),这两种方法均针对计数结果(特别是过度离散)相关的挑战而设计。我们的分析表明,MERF 不假设泊松分布来建模计数数据的均值行为,在严重过度离散的场景下表现优异。相反,在泊松分布假设得到适度满足的条件下,GMERF 表现最佳。此外,我们引入并评估了三种自举方法——一种参数方法和两种非参数方法——旨在评估区域水平均值点估计量的可靠性。这些方法的有效性通过基于模型(以及基于设计)的模拟进行了测试,并应用于墨西哥格雷罗州的真实数据集,证明了其稳健性和实际应用潜力。