The widespread use of machine learning and data-driven algorithms for decision making has been steadily increasing over many years. The areas in which this is happening are diverse: healthcare, employment, finance, education, the legal system to name a few; and the associated negative side effects are being increasingly harmful for society. Negative data \emph{bias} is one of those, which tends to result in harmful consequences for specific groups of people. Any mitigation strategy or effective policy that addresses the negative consequences of bias must start with awareness that bias exists, together with a way to understand and quantify it. However, there is a lack of consensus on how to measure data bias and oftentimes the intended meaning is context dependent and not uniform within the research community. The main contributions of our work are: (1) The definition of Uniform Bias (UB), the first bias measure with a clear and simple interpretation in the full range of bias values. (2) A systematic study to characterize the flaws of existing measures in the context of anti employment discrimination rules used by the Office of Federal Contract Compliance Programs, additionally showing how UB solves open problems in this domain. (3) A framework that provides an efficient way to derive a mathematical formula for a bias measure based on an algorithmic specification of bias addition. Our results are experimentally validated using nine publicly available datasets and theoretically analyzed, which provide novel insights about the problem. Based on our approach, we also design a bias mitigation model that might be useful to policymakers.
翻译:机器学习和数据驱动算法在决策制定中的广泛应用多年来持续增长。其应用领域多样:医疗保健、就业、金融、教育、司法系统等;相关的负面副作用对社会造成的危害日益加剧。负面数据偏差便是其中之一,它往往导致对特定人群的有害后果。任何旨在解决偏差负面影响的缓解策略或有效政策,都必须始于对偏差存在的认知,并辅以理解和量化偏差的方法。然而,关于如何度量数据偏差缺乏共识,且其预期含义常依赖于上下文,在研究社区内并不统一。我们工作的主要贡献包括:(1)定义了均匀偏差(Uniform Bias, UB),这是首个在全部偏差值范围内具有清晰简明解释的偏差度量方法。(2)系统性地研究了现有度量方法在美国联邦合同合规项目办公室反就业歧视规则背景下的缺陷,并进一步展示了UB如何解决该领域的开放性问题。(3)提出一个框架,能够基于偏差添加的算法规范,高效推导出偏差度量的数学公式。我们的结果通过九个公开可用数据集的实验验证和理论分析,为该问题提供了新的见解。基于我们的方法,我们还设计了一个可能对政策制定者有用的偏差缓解模型。