To ensure unbiased and ethical automated predictions, fairness must be a core principle in machine learning applications. Fairness in machine learning aims to mitigate biases present in the training data and model imperfections that could lead to discriminatory outcomes. This is achieved by preventing the model from making decisions based on sensitive characteristics like ethnicity or sexual orientation. A fundamental assumption in machine learning is the independence of observations. However, this assumption often does not hold true for data describing social phenomena, where data points are often clustered based. Hence, if the machine learning models do not account for the cluster correlations, the results may be biased. Especially high is the bias in cases where the cluster assignment is correlated to the variable of interest. We present a fair mixed effects support vector machine algorithm that can handle both problems simultaneously. With a reproducible simulation study we demonstrate the impact of clustered data on the quality of fair machine learning predictions.
翻译:为确保自动化预测的无偏性和伦理性,公平性必须成为机器学习应用的核心原则。机器学习中的公平性旨在缓解训练数据中存在的偏见以及可能导致歧视性结果的模型缺陷,其实现方式是防止模型基于种族或性取向等敏感特征做出决策。机器学习的一个基本假设是观测值的独立性,然而这一假设通常不适用于描述社会现象的数据,因为数据点往往基于聚类分组。因此,若机器学习模型未考虑聚类相关性,结果可能存在偏差。当聚类分配与目标变量相关时,这种偏差尤为显著。本文提出了一种能够同时处理这两个问题的公平混合效应支持向量机算法,并通过可重复的模拟研究论证了聚类数据对公平机器学习预测质量的影响。