The presence of specific linguistic signals particular to a certain sub-group of people can be picked up by language models during training. If the model begins to associate specific language with a distinct group, any decisions made based upon this language would hold a strong correlation to a decision based upon their protected characteristic, leading to possible discrimination. We explore a potential technique for bias mitigation in the form of simplification of text. The driving force of this idea is that simplifying text should standardise language between different sub-groups to one way of speaking while keeping the same meaning. The experiment shows promising results as the classifier accuracy for predicting the sensitive attribute drops by up to 17% for the simplified data.
翻译:特定子群体独有的语言信号可能在训练过程中被语言模型捕获。若模型开始将特定语言与特定群体关联,基于该语言做出的任何决策将与基于受保护特征的决策高度相关,进而可能导致歧视。我们探索了一种通过文本简化来缓解偏见的潜在技术。该思路的核心在于:文本简化应在保持语义不变的前提下,将不同子群体的语言表达标准化为统一的话语方式。实验结果表明,简化后的数据使敏感属性预测分类器的准确率最高下降17%,展现了该方法的前景。