In recent research on large language models (LLMs), there has been a growing emphasis on aligning these models with human values to reduce the impact of harmful content. However, current alignment methods often rely solely on singular forms of human feedback, such as preferences, annotated labels, or natural language critiques, overlooking the potential advantages of combining these feedback types. This limitation leads to suboptimal performance, even when ample training data is available. In this paper, we introduce Constructive and Diverse Feedback (CDF) as a novel method to enhance LLM alignment, inspired by constructivist learning theory. Our approach involves collecting three distinct types of feedback tailored to problems of varying difficulty levels within the training dataset. Specifically, we exploit critique feedback for easy problems, refinement feedback for medium problems, and preference feedback for hard problems. By training our model with this diversified feedback, we achieve enhanced alignment performance while using less training data. To assess the effectiveness of CDF, we evaluate it against previous methods in three downstream tasks: question answering, dialog generation, and text summarization. Experimental results demonstrate that CDF achieves superior performance even with a smaller training dataset.
翻译:在近期关于大语言模型的研究中,越来越强调将这些模型与人类价值观对齐,以减少有害内容的影响。然而,当前的对齐方法通常仅依赖单一形式的人类反馈(如偏好、标注标签或自然语言批评),忽视了结合这些反馈类型的潜在优势。这一局限性导致模型性能欠佳,即便训练数据充足也是如此。本文受建构主义学习理论启发,提出了一种名为"建设性与多样化反馈"(CDF)的新方法,旨在增强大语言模型的对齐效果。我们的方法针对训练数据集中难度各异的问题,收集三种不同类型的反馈:对简单问题采用批评反馈,对中等难度问题采用改进反馈,对困难问题采用偏好反馈。通过使用这种多样化反馈训练模型,我们能在使用更少训练数据的情况下实现更优的对齐性能。为评估CDF的有效性,我们在三个下游任务(问答、对话生成、文本摘要)中将其与以往方法进行对比。实验结果表明,即便使用更小的训练数据集,CDF仍能取得更优的性能。