Estimating the prevalence of a medical condition, or the proportion of the population in which it occurs, is a fundamental problem in healthcare and public health. Accurate estimates of the relative prevalence across groups -- capturing, for example, that a condition affects women more frequently than men -- facilitate effective and equitable health policy which prioritizes groups who are disproportionately affected by a condition. However, it is difficult to estimate relative prevalence when a medical condition is underreported. In this work, we provide a method for accurately estimating the relative prevalence of underreported medical conditions, building upon the positive unlabeled learning framework. We show that under the commonly made covariate shift assumption -- i.e., that the probability of having a disease conditional on symptoms remains constant across groups -- we can recover the relative prevalence, even without restrictive assumptions commonly made in positive unlabeled learning and even if it is impossible to recover the absolute prevalence. We conduct experiments on synthetic and real health data which demonstrate our method's ability to recover the relative prevalence more accurately than do baselines, and demonstrate the method's robustness to plausible violations of the covariate shift assumption. We conclude by illustrating the applicability of our method to case studies of intimate partner violence and hate speech.
翻译:评估某种医疗状况的患病率(即人群中发生该状况的比例)是医疗保健和公共卫生领域的基本问题。准确估计不同群体间的相对患病率(例如,揭示某种疾病对女性的影响频率高于男性),有助于制定优先考虑受疾病影响不成比例群体的有效且公平的卫生政策。然而,当医疗状况存在漏报时,相对患病率的估算变得困难。本研究基于正无标记学习框架,提出了一种用于准确估计漏报医疗状况相对患病率的方法。我们证明,在常见的协变量偏移假设(即各组别中症状条件下患病概率保持不变)下,即使无需正无标记学习中常见的严格假设,甚至当无法恢复绝对患病率时,我们仍能恢复相对患病率。我们在合成数据和真实健康数据上进行的实验表明,该方法相较于基线方法能更准确地恢复相对患病率,并验证了该方法对协变量偏移假设的合理违反具有鲁棒性。最后,我们通过亲密伴侣暴力和仇恨言论的案例研究,展示了该方法的适用性。