Given the magnitude of data generation currently, both in quantity and speed, the use of machine learning is increasingly important. When data include protected features that might give rise to discrimination, special care must be taken. Data quality is critical in these cases, as biases in training data can be reflected in classification models. This has devastating consequences and fails to comply with current regulations. Data-Centric Artificial Intelligence proposes dataset modifications to improve its quality. Instance selection via undersampling can foster balanced learning of classes and protected feature values in the classifier. When such undersampling is done close to the decision boundary, the effect on the classifier would be bolstered. This work proposes Fair Overlap Number of Balls (Fair-ONB), an undersampling method that harnesses the data morphology of the different data groups (obtained from the combination of classes and protected feature values) to perform guided undersampling in the areas where they overlap. It employs attributes of the ball coverage of the groups, such as the radius, number of covered instances and density, to select the most suitable areas for undersampling and reduce bias. Results show that the Fair-ONB method reduces bias with low impact on the classifier's predictive performance.
翻译:鉴于当前数据生成的规模(包括数量和速度),机器学习的使用日益重要。当数据包含可能引发歧视的受保护特征时,必须特别谨慎。在这些情况下,数据质量至关重要,因为训练数据中的偏倚可能反映在分类模型中。这会带来破坏性后果,且不符合现行法规。以数据为中心的人工智能提出通过修改数据集来提高其质量。通过欠采样进行实例选择可以促进分类器对类别和受保护特征值的平衡学习。当此类欠采样在决策边界附近进行时,对分类器的效果将得到增强。本研究提出公平重叠球数(Fair-ONB),这是一种利用不同数据组(由类别和受保护特征值组合得到)的数据形态学,在它们重叠的区域进行引导式欠采样的方法。该方法利用各数据组的球覆盖属性(如半径、覆盖实例数和密度)来选择最适合欠采样的区域,从而减少偏倚。结果表明,Fair-ONB方法能够有效减少偏倚,同时对分类器的预测性能影响较小。