Computer-based decision systems are widely used to automate decisions in many aspects of everyday life, which include sensitive areas like hiring, loaning and even criminal sentencing. A decision pipeline heavily relies on large volumes of historical real-world data for training its models. However, historical training data often contains gender, racial or other biases which are propagated to the trained models influencing computer-based decisions. In this work, we propose a robust methodology that guarantees the removal of unwanted biases while maximally preserving classification utility. Our approach can always achieve this in a model-independent way by deriving from real-world data the asymptotic dataset that uniquely encodes demographic parity and realism. As a proof-of-principle, we deduce from public census records such an asymptotic dataset from which synthetic samples can be generated to train well-established classifiers. Benchmarking the generalization capability of these classifiers trained on our synthetic data, we confirm the absence of any explicit or implicit bias in the computer-aided decision.
翻译:基于计算机的决策系统广泛应用于日常生活的诸多方面,包括招聘、贷款甚至刑事判决等敏感领域。决策流程高度依赖大规模历史真实世界数据来训练模型。然而,历史训练数据常包含性别、种族或其他偏见,这些偏见会传播至训练后的模型,影响计算机决策。本研究提出了一种稳健的方法论,能够在最大化保留分类实用性的同时,确保消除不必要偏见。该方法通过从真实世界数据中推导出渐近数据集,该数据集唯一地编码了人口均等性与现实性,从而始终以模型无关的方式实现这一目标。作为原理验证,我们从公开人口普查记录中推导出此类渐近数据集,并从中生成合成样本来训练成熟的分类器。通过对这些基于合成数据训练的分类器进行泛化能力基准测试,我们确认了计算机辅助决策中不存在任何显性或隐性偏见。