Bayesian Improved Surname Geocoding (BISG) is a ubiquitous tool for predicting race and ethnicity using an individual's geolocation and surname. BISG assumes that in the United States population, surname and geolocation are independent given a particular race or ethnicity. This assumption appears to contradict conventional wisdom including that people often live near their relatives (with the same surname and race). We demonstrate that this independence assumption results in systematic biases for minority subpopulations and we introduce a simple alternative to BISG. Our raking-based prediction algorithm offers a significant improvement over BISG and we validate our algorithm on states' voter registration lists that contain self-identified race/ethnicity. The proposed improvement and the inaccuracies of BISG generalize to applications in election law, health care, finance, tech, law enforcement and many other fields.
翻译:贝叶斯改进姓氏地理编码(BISG)是一种通过个体地理位置和姓氏来预测种族或族裔的通用工具。BISG假设,在美国人口中,给定特定种族或族裔时,姓氏与地理位置是独立的。该假设似乎与常识相悖——人们往往与同姓氏、同种族的亲属比邻而居。我们证明,这种独立性假设会导致针对少数族裔子群体的系统性偏差,并提出了一种简单的BISG替代方案。基于分层推演(raking)的预测算法显著优于BISG,并在包含自报种族/族裔信息的各州选民登记名单上验证了算法的有效性。所提出的改进方案及BISG的误差可推广至选举法、医疗健康、金融、科技、执法等众多领域的应用。