The training of large language models (LLMs) on extensive, unfiltered corpora sourced from the internet is a common and advantageous practice. Consequently, LLMs have learned and inadvertently reproduced various types of biases, including violent, offensive, and toxic language. However, recent research shows that generative pretrained transformer (GPT) language models can recognize their own biases and detect toxicity in generated content, a process referred to as self-diagnosis. In response, researchers have developed a decoding algorithm that allows LLMs to self-debias, or reduce their likelihood of generating harmful text. This study investigates the efficacy of the diagnosing-debiasing approach in mitigating two additional types of biases: insults and political bias. These biases are often used interchangeably in discourse, despite exhibiting potentially dissimilar semantic and syntactic properties. We aim to contribute to the ongoing effort of investigating the ethical and social implications of human-AI interaction.
翻译:大型语言模型(LLMs)在源自互联网的大规模未经过滤语料上进行训练已成为一种普遍且有益的做法。然而,LLMs也因此学习并在无意中复现了各类偏见,包括暴力、攻击性和有害语言。近期研究表明,生成式预训练Transformer语言模型能够识别自身偏见并检测生成内容中的毒性,这一过程被称为自我诊断。为此,研究者开发出解码算法,使LLMs能够实现自我去偏,即降低生成有害文本的概率。本研究探究了诊断-去偏方法在缓解两种额外偏见类型(侮辱性语言与政治偏见)中的有效性。这两种偏见在论述中常被互换使用,尽管它们可能展现出不同的语义与句法特征。我们旨在为持续探讨人机交互伦理与社会影响的研究工作作出贡献。