This work investigates the potential of undermining both fairness and detection performance in abusive language detection. In a dynamic and complex digital world, it is crucial to investigate the vulnerabilities of these detection models to adversarial fairness attacks to improve their fairness robustness. We propose a simple yet effective framework FABLE that leverages backdoor attacks as they allow targeted control over the fairness and detection performance. FABLE explores three types of trigger designs (i.e., rare, artificial, and natural triggers) and novel sampling strategies. Specifically, the adversary can inject triggers into samples in the minority group with the favored outcome (i.e., "non-abusive") and flip their labels to the unfavored outcome, i.e., "abusive". Experiments on benchmark datasets demonstrate the effectiveness of FABLE attacking fairness and utility in abusive language detection.
翻译:本研究探讨了在辱骂语言检测中削弱公平性与检测性能的潜在可能性。在动态且复杂的数字世界中,探究这些检测模型面对对抗性公平攻击的脆弱性,对于提升其公平性鲁棒性至关重要。我们提出了一种简单而有效的框架FABLE,该框架利用后门攻击,允许对公平性和检测性能进行定向控制。FABLE探索了三种类型的触发器设计(即稀有触发器、人工触发器和自然触发器)以及新型采样策略。具体而言,攻击者可将触发器注入到具有有利结果(即“非辱骂”)的少数群体样本中,并将其标签翻转为不利结果,即“辱骂”。在基准数据集上的实验表明,FABLE在攻击辱骂语言检测的公平性和实用性方面具有有效性。