In recent years, the advent of the attention mechanism has significantly advanced the field of natural language processing (NLP), revolutionizing text processing and text generation. This has come about through transformer-based decoder-only architectures, which have become ubiquitous in NLP due to their impressive text processing and generation capabilities. Despite these breakthroughs, language models (LMs) remain susceptible to generating undesired outputs: inappropriate, offensive, or otherwise harmful responses. We will collectively refer to these as ``toxic'' outputs. Although methods like reinforcement learning from human feedback (RLHF) have been developed to align model outputs with human values, these safeguards can often be circumvented through carefully crafted prompts. Therefore, this paper examines the extent to which LLMs generate toxic content when prompted, as well as the linguistic factors -- both lexical and syntactic -- that influence the production of such outputs in generative models.
翻译:近年来,注意力机制的出现显著推动了自然语言处理领域的发展,彻底改变了文本处理与文本生成的方式。这一进展主要得益于基于Transformer的纯解码器架构,因其卓越的文本处理与生成能力,现已在自然语言处理领域无处不在。尽管取得了这些突破,语言模型仍容易生成不良输出:不恰当、冒犯性或具有其他危害性的回应。我们将此类输出统称为“毒性”输出。虽然已开发出基于人类反馈的强化学习等方法以使模型输出与人类价值观对齐,但这些安全措施常可通过精心设计的提示词被规避。因此,本文研究了大型语言模型在受到提示时生成毒性内容的程度,以及影响生成式模型产生此类输出的语言因素——包括词汇与句法层面的特征。