Text classifiers have promising applications in high-stake tasks such as resume screening and content moderation. These classifiers must be fair and avoid discriminatory decisions by being invariant to perturbations of sensitive attributes such as gender or ethnicity. However, there is a gap between human intuition about these perturbations and the formal similarity specifications capturing them. While existing research has started to address this gap, current methods are based on hardcoded word replacements, resulting in specifications with limited expressivity or ones that fail to fully align with human intuition (e.g., in cases of asymmetric counterfactuals). This work proposes novel methods for bridging this gap by discovering expressive and intuitive individual fairness specifications. We show how to leverage unsupervised style transfer and GPT-3's zero-shot capabilities to automatically generate expressive candidate pairs of semantically similar sentences that differ along sensitive attributes. We then validate the generated pairs via an extensive crowdsourcing study, which confirms that a lot of these pairs align with human intuition about fairness in the context of toxicity classification. Finally, we show how limited amounts of human feedback can be leveraged to learn a similarity specification that can be used to train downstream fairness-aware models.
翻译:文本分类器在高风险任务(如简历筛选与内容审核)中具有广阔应用前景。这些分类器必须保持公平性,通过对性别或种族等敏感属性的扰动保持不变性,避免产生歧视性决策。然而,人类对这些扰动的直觉与捕捉它们的正式相似性规范之间存在差距。尽管现有研究已开始弥合这一差距,但当前方法基于硬编码的词替换,导致规范表达力有限或无法完全与人类直觉对齐(例如非对称反事实情形)。本文提出新颖方法通过发现兼具表达力与直觉性的个体公平规范来弥合这一差距。我们展示了如何利用无监督风格迁移和GPT-3的零样本能力,自动生成语义相似但敏感属性不同的表达性候选句对。随后通过大规模众包研究验证生成的句对,证实大量句对在毒性分类场景下与人类对公平性的直觉一致。最后,我们展示了如何利用有限的人类反馈,学习可用于训练下游公平感知模型的相似性规范。