Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP

When trained on large, unfiltered crawls from the internet, language models pick up and reproduce all kinds of undesirable biases that can be found in the data: they often generate racist, sexist, violent or otherwise toxic language. As large models require millions of training examples to achieve good performance, it is difficult to completely prevent them from being exposed to such content. In this paper, we first demonstrate a surprising finding: pretrained language models recognize, to a considerable degree, their undesirable biases and the toxicity of the content they produce. We refer to this capability as self-diagnosis. Based on this finding, we then propose a decoding algorithm that, given only a textual description of the undesired behavior, reduces the probability of a language model producing problematic text. We refer to this approach as self-debiasing. Self-debiasing does not rely on manually curated word lists, nor does it require any training data or changes to the model's parameters. While we by no means eliminate the issue of language models generating biased text, we believe our approach to be an important step in this direction.

翻译：语言模型在从互联网上进行大规模、未过滤的爬行培训时,会发现并复制数据中发现的各种不良偏见:它们往往产生种族主义、性别歧视、暴力或其他有毒语言。由于大型模型需要数百万个培训范例才能取得良好的表现,因此很难完全防止它们接触这种内容。在本文中,我们首先展示出一个令人惊讶的发现:预先培训的语言模型在相当程度上认识到它们的不良偏见及其所产生内容的毒性。我们把这种能力称为自我诊断。根据这一发现,我们然后建议一种解码算法,只要对不理想行为进行文字描述,就能降低产生问题文本的语言模型的概率。我们将此方法称为自我偏差。自我偏差并不依赖手动的拼写词表,也不要求任何培训数据或改变模型参数。虽然我们绝非要消除产生偏差文本的语言模型的问题,但我们认为我们的方法是朝着这个方向迈出的重要一步。