Racial and other demographic imputation is necessary for many applications, especially in auditing disparities and outreach targeting in political campaigns. The canonical approach is to construct continuous predictions -- e.g., based on name and geography -- and then to $\textit{discretize}$ the predictions by selecting the most likely class (argmax). We study how this practice produces $\textit{discretization bias}$. In particular, we show that argmax labeling, as used by a prominent commercial voter file vendor to impute race/ethnicity, results in a substantial under-count of African-American voters, e.g., by 28.2% points in North Carolina. This bias can have substantial implications in downstream tasks that use such labels. We then introduce a $\textit{joint optimization}$ approach -- and a tractable $\textit{data-driven thresholding}$ heuristic -- that can eliminate this bias, with negligible individual-level accuracy loss. Finally, we theoretically analyze discretization bias, show that calibrated continuous models are insufficient to eliminate it, and that an approach such as ours is necessary. Broadly, we warn researchers and practitioners against discretizing continuous demographic predictions without considering downstream consequences.
翻译:种族及其他人口统计特征的推断在许多应用中至关重要,尤其是在审计政治竞选中的差异性和定向推广时。传统方法是构建连续预测(例如基于姓名和地理位置),然后通过选择最可能的类别(argmax)对预测结果进行$\textit{离散化}$处理。本文研究了这一做法如何产生$\textit{离散化偏差}$。具体而言,我们证明某知名商业选民档案供应商采用argmax标注法推断种族/民族时,会导致非裔美国选民数量被严重低估(例如在北卡罗来纳州低估28.2个百分点)。这种偏差可能对使用此类标签的下游任务产生重大影响。我们随后提出一种$\textit{联合优化}$方法——以及一种可处理的$\textit{数据驱动阈值}$启发式算法——能够在个体层面精度损失可忽略的前提下消除此类偏差。最后,我们从理论上分析了离散化偏差,证明经过校准的连续模型不足以消除该偏差,而必须采用我们提出的方法。总体而言,我们警示研究人员和实践者在离散化连续人口统计预测时,必须考虑其对下游任务的影响。