Density Ratio Estimation (DRE) is an important machine learning technique with many downstream applications. We consider the challenge of DRE with missing not at random (MNAR) data. In this setting, we show that using standard DRE methods leads to biased results while our proposal (M-KLIEP), an adaptation of the popular DRE procedure KLIEP, restores consistency. Moreover, we provide finite sample estimation error bounds for M-KLIEP, which demonstrate minimax optimality with respect to both sample size and worst-case missingness. We then adapt an important downstream application of DRE, Neyman-Pearson (NP) classification, to this MNAR setting. Our procedure both controls Type I error and achieves high power, with high probability. Finally, we demonstrate promising empirical performance both synthetic data and real-world data with simulated missingness.
翻译:密度比估计(DRE)是一项重要的机器学习技术,具有众多下游应用。我们针对非随机缺失(MNAR)数据下的DRE挑战展开研究。在此设定中,我们证明使用标准DRE方法会导致有偏结果,而我们的方法M-KLIEP(对主流DRE方法KLIEP的改进)能够恢复一致性。此外,我们为M-KLIEP提供了有限样本估计误差界,证明其关于样本量和最坏情况缺失率均达到极小化最优性。随后,我们将DRE的重要下游应用——奈曼-皮尔逊(NP)分类——适配至该MNAR场景。所提方法能以高概率控制第一类错误并实现高检验功效。最后,我们在合成数据及模拟缺失的真实数据上展示了令人鼓舞的实证性能。