To answer questions about racial inequality and fairness, we often need a way to infer race and ethnicity from names. One way to infer race and ethnicity from names is by relying on the Census Bureau's list of popular last names. The list, however, suffers from at least three limitations: 1. it only contains last names, 2. it only includes popular last names, and 3. it is updated once every 10 years. To provide better generalization, and higher accuracy when first names are available, we model the relationship between characters in a name and race and ethnicity using various techniques. A model using Long Short-Term Memory works best with out-of-sample accuracy of .85. The best-performing last-name model achieves out-of-sample accuracy of .81. To illustrate the utility of the models, we apply them to campaign finance data to estimate the share of donations made by people of various racial groups, and to news data to estimate the coverage of various races and ethnicities in the news.
翻译:为了回答关于种族不平等和公平性的问题,我们通常需要一种从名字推断种族和民族的方法。一种方法依赖于人口普查局流行的姓氏列表,但该列表至少存在三个局限:1. 仅包含姓氏;2. 仅收录流行姓氏;3. 每十年更新一次。为了在提供名字时实现更好的泛化性和更高准确性,我们采用多种技术对名字中字符与种族和民族之间的关系进行建模。使用长短期记忆网络的模型在样本外准确率达到0.85,表现最佳;最佳姓氏模型的样本外准确率为0.81。为展示模型实用性,我们将其应用于竞选财务数据以估算不同种族群体的捐款占比,并应用于新闻数据以估算新闻中不同种族和民族的报道覆盖率。