Bayesian Improved Surname Geocoding (BISG) is a ubiquitous tool for predicting race and ethnicity using an individual's geolocation and surname. Here we demonstrate that statistical dependence of surname and geolocation within racial/ethnic categories in the United States results in biases for minority subpopulations, and we introduce a raking-based improvement. Our method augments the data used by BISG--distributions of race by geolocation and race by surname--with the distribution of surname by geolocation obtained from state voter files. We validate our algorithm on state voter registration lists that contain self-identified race/ethnicity.
翻译:贝叶斯改进型姓氏地理编码(BISG)是一种广泛使用的工具,通过个体地理位置和姓氏预测其种族/族裔信息。本研究证明,美国境内种族/族裔类别中姓氏与地理位置的统计关联性会导致对少数群体的预测偏差,并提出一种基于事后分层加权的改进方法。该方法以来自州选民档案的姓氏-地理分布数据,增强BISG原有数据源(种族-地理分布与种族-姓氏分布)。我们利用包含自报种族/族裔信息的州选民登记名册验证了该算法的有效性。