The generation of undesirable and factually incorrect content of large language models poses a significant challenge and remains largely an unsolved issue. This paper studies the integration of a contrastive learning objective for fine-tuning LLMs for implicit knowledge editing and controlled text generation. Optimizing the training objective entails aligning text perplexities in a contrastive fashion. To facilitate training the model in a self-supervised fashion, we leverage an off-the-shelf LLM for training data generation. We showcase applicability in the domain of detoxification. Herein, the proposed approach leads to a significant decrease in the generation of toxic content while preserving general utility for downstream tasks such as commonsense reasoning and reading comprehension. The proposed approach is conceptually simple but empirically powerful.
翻译:大语言模型生成不良和事实错误内容是一个重大挑战,且在很大程度上仍是一个未解决的问题。本文研究了将对比学习目标集成到微调大语言模型中,以实现隐式知识编辑和受控文本生成。优化训练目标需要以对比方式对齐文本困惑度。为支持以自监督方式训练模型,我们利用现成的大语言模型生成训练数据。我们在去毒化领域展示了其适用性。在此,所提出的方法在显著减少有毒内容生成的同时,保留了下游任务(如常识推理和阅读理解)的通用性能。所提出的方法在概念上简单,但在经验上表现强大。