Mitigating Included- and Omitted-Variable Bias in Estimates of Disparate Impact

Managers, employers, policymakers, and others often seek to understand whether decisions are biased against certain groups. One popular analytic strategy is to estimate disparities after adjusting for observed covariates, typically with a regression model. This approach, however, suffers from two key statistical challenges. First, omitted-variable bias can skew results if the model does not adjust for all relevant factors; second, and conversely, included-variable bias -- a lesser-known phenomenon -- can skew results if the set of covariates includes irrelevant factors. Here we introduce a new, three-step statistical method, which we call risk-adjusted regression, to address both concerns in settings where decision makers have clearly measurable objectives. In the first step, we use all available covariates to estimate the value, or inversely, the risk, of taking a certain action, such as approving a loan application or hiring a job candidate. Second, we measure disparities in decisions after adjusting for these risk estimates alone, mitigating the problem of included-variable bias. Finally, in the third step, we assess the sensitivity of results to potential mismeasurement of risk, addressing concerns about omitted-variable bias. To do so, we develop a novel, non-parametric sensitivity analysis that yields tight bounds on the true disparity in terms of the average gap between true and estimated risk -- a single interpretable parameter that facilitates credible estimates. We demonstrate this approach on a detailed dataset of 2.2 million police stops of pedestrians in New York City, and show that traditional statistical tests of discrimination can substantially underestimate the magnitude of disparities.

翻译：管理者、雇主、政策制定者及其他相关方常需判断决策是否对特定群体存在偏见。一种常用的分析策略是在调整观测协变量后估计差异，通常借助回归模型实现。然而，该方法面临两大统计挑战：第一，若模型未调整所有相关因素，遗漏变量偏差可能导致结果偏误；第二，相反地，当协变量集包含无关因素时，鲜为人知的包含变量偏差同样会扭曲结果。针对决策者具有明确可量化目标的场景，本文提出一种新型三步统计方法——风险调整回归——以同时解决上述两类问题。第一步，利用所有可用协变量估计采取特定行动（如批准贷款申请或录用求职者）的价值或风险；第二步，在仅调整风险估计值后衡量决策差异，从而缓解包含变量偏差问题；第三步，评估结果对潜在风险测量误差的敏感性，以应对遗漏变量偏差问题。为此，我们开发了一种新颖的非参数敏感性分析方法，通过真实风险与估计风险之间平均差距这一可解释的单一参数，得出真实差异的紧致界限，从而促进可信估计。我们基于纽约市220万次行人拦截详细数据集验证该方法，结果表明传统歧视统计检验可能显著低估差异的严重程度。