When using machine learning for automated prediction, it is important to account for fairness in the prediction. Fairness in machine learning aims to ensure that biases in the data and model inaccuracies do not lead to discriminatory decisions. E.g., predictions from fair machine learning models should not discriminate against sensitive variables such as sexual orientation and ethnicity. The training data often in obtained from social surveys. In social surveys, oftentimes the data collection process is a strata sampling, e.g. due to cost restrictions. In strata samples, the assumption of independence between the observation is not fulfilled. Hence, if the machine learning models do not account for the strata correlations, the results may be biased. Especially high is the bias in cases where the strata assignment is correlated to the variable of interest. We present in this paper an algorithm that can handle both problems simultaneously, and we demonstrate the impact of stratified sampling on the quality of fair machine learning predictions in a reproducible simulation study.
翻译:在使用机器学习进行自动化预测时,确保预测的公平性至关重要。机器学习公平性旨在避免数据和模型中的偏差导致歧视性决策。例如,公平机器学习模型的预测不应基于性取向、种族等敏感变量产生歧视。训练数据通常来自社会调查,而社会调查的数据收集过程常采用分层抽样(例如受成本限制)。在分层样本中,观测值之间独立性的假设并不成立。因此,若机器学习模型未考虑层间相关性,结果可能出现偏差。当分层分配与目标变量相关时,这种偏差尤为显著。本文提出一种能同时处理这两个问题的算法,并通过可重复的模拟研究,论证了分层抽样对公平机器学习预测质量的影响。