Population size estimation from capture-recapture data is central for studying hard-to-reach populations, incorporating auxiliary covariates to account for heterogeneous capture probabilities and recapture dependencies. However, missing attributes pose a critical methodological challenge due to reluctance to share sensitive information, data collection limitations, and imperfect record linkage. Existing approaches either ignore missingness or rely on a priori imputation, potentially introducing substantial bias. In this work, we develop a novel nonparametric estimation framework using a Missing at Random assumption to identify capture probabilities under missing covariates. Using semiparametric efficiency theory, we construct one-step estimators that combine efficiency, robustness, and finite-sample validity: they approximately achieve the nonparametric efficiency bound, accommodate flexible machine learning methods through a doubly robust structure, and provide approximately valid inference for any sample size. Simulations demonstrate substantial improvements over naive imputation approaches, with our doubly robust ML estimators maintaining valid inference even at high missingness rates where competing methods fail. We apply our methodology to re-estimate mortality in the Gaza Strip from October 7, 2023, to June 30, 2024, using three-list capture-recapture data with missing demographic information. Our approach yields more conservative yet precise estimates compared to previous methods, indicating the true death toll exceeds official statistics by approximately 26%. Our framework provides practitioners with principled tools for handling incomplete data in conflict settings and other applications with hard-to-reach populations.
翻译:从捕获-再捕获数据中估计人口规模是研究难以触及群体的核心方法,该方法通过纳入辅助协变量来考虑异质性捕获概率和再捕获依赖性。然而,由于受访者不愿分享敏感信息、数据收集限制以及记录链接不完善,属性缺失问题构成了关键的方法学挑战。现有方法要么忽略缺失性,要么依赖先验插补,可能引入显著偏差。本研究基于随机缺失假设,开发了一种新颖的非参数估计框架,用于在协变量缺失情况下识别捕获概率。利用半参数效率理论,我们构建了兼具效率性、稳健性和有限样本有效性的单步估计量:这些估计量近似达到非参数效率界,通过双重稳健结构兼容灵活的机器学习方法,并为任意样本量提供近似有效的统计推断。仿真实验表明,相较于朴素插补方法,本方法取得显著改进;即使在竞争方法失效的高缺失率情况下,我们的双重稳健机器学习估计量仍能保持有效推断。我们将该方法应用于重新估计2023年10月7日至2024年6月30日加沙地带的死亡率,使用包含缺失人口统计信息的三列表捕获-再捕获数据。与先前方法相比,我们的方法产生了更为保守但精确的估计,表明实际死亡人数超出官方统计约26%。该框架为从业者提供了处理冲突环境及其他难以触及群体应用中不完整数据的原理性工具。