In the absence of sensitive race and ethnicity data, researchers, regulators, and firms alike turn to proxies. In this paper, I train a Bidirectional Long Short-Term Memory (BiLSTM) model on a novel dataset of voter registration data from all 50 US states and create an ensemble that achieves up to 36.8% higher out of sample (OOS) F1 scores than the best performing machine learning models in the literature. Additionally, I construct the most comprehensive database of first and surname distributions in the US in order to improve the coverage and accuracy of Bayesian Improved Surname Geocoding (BISG) and Bayesian Improved Firstname Surname Geocoding (BIFSG). Finally, I provide the first high-quality benchmark dataset in order to fairly compare existing models and aid future model developers.
翻译:在缺乏敏感种族与民族数据的情况下,研究人员、监管机构及企业纷纷转向使用代理指标。本文基于包含美国50个州选民注册数据的新颖数据集,训练了双向长短期记忆(BiLSTM)模型,并构建集成模型,其样本外F1分数比文献中性能最佳的机器学习模型高出36.8%。此外,为提升贝叶斯改进姓氏地理编码(BISG)与贝叶斯改进姓名地理编码(BIFSG)的覆盖范围与准确性,我构建了美国最全面的姓氏与名字分布数据库。最后,我提供了首个高质量基准数据集,以公平比较现有模型并助力未来模型开发者。