In recent research on large language models (LLMs), there has been a growing emphasis on aligning these models with human values to reduce the impact of harmful content. However, current alignment methods often rely solely on singular forms of human feedback, such as preferences, annotated labels, or natural language critiques, overlooking the potential advantages of combining these feedback types. This limitation leads to suboptimal performance, even when ample training data is available. In this paper, we introduce Constructive and Diverse Feedback (CDF) as a novel method to enhance LLM alignment, inspired by constructivist learning theory. Our approach involves collecting three distinct types of feedback tailored to problems of varying difficulty levels within the training dataset. Specifically, we exploit critique feedback for easy problems, refinement feedback for medium problems, and preference feedback for hard problems. By training our model with this diversified feedback, we achieve enhanced alignment performance while using less training data. To assess the effectiveness of CDF, we evaluate it against previous methods in three downstream tasks: question answering, dialog generation, and text summarization. Experimental results demonstrate that CDF achieves superior performance even with a smaller training dataset.
翻译:近年来,关于大语言模型(LLMs)的研究日益强调将这些模型与人类价值观对齐,以减少有害内容的影响。然而,当前的对齐方法往往仅依赖单一形式的人类反馈,如偏好、标注标签或自然语言批评,忽略了结合这些反馈类型的潜在优势。这一局限性导致即便训练数据充足,模型性能仍无法达到最优。本文受建构主义学习理论启发,提出一种名为"建设性与多样化反馈"(CDF)的新方法,以增强LLM对齐效果。我们的方法针对训练数据集中不同难度的问题,收集三种不同类型的反馈:对简单问题采用批评反馈,对中等难度问题采用改进反馈,对困难问题采用偏好反馈。通过使用这种多样化的反馈训练模型,我们在减少训练数据量的同时实现了更优的对齐性能。为评估CDF的有效性,我们在问答、对话生成和文本摘要三个下游任务中将其与先前方法进行对比。实验结果表明,即使使用更小的训练数据集,CDF仍能取得更优的性能。