Advancements in large language models (LLMs) have demonstrated remarkable capabilities across a diverse range of applications. These models excel in generating text completions that are contextually coherent and cover an extensive array of subjects. However, the vast datasets required for their training make aligning response styles during the pretraining and instruction tuning phases challenging. Consequently, an additional alignment phase is typically employed, wherein the model is further trained with human preference data to better align its outputs with human expectations. While this process doesn't introduce new capabilities per se, it does accentuate generation styles innate to the model. This paper explores the utilization of counterfactual prompting within the framework of Direct Preference Optimization (DPO) to align the model's style without relying on human intervention. We demonstrate that this method effectively instils desirable behaviour, mitigates undesirable ones, and encourages the model to disregard inappropriate instructions. Our findings suggest that counterfactual prompting with DPO presents a low-resource way to fine-tune LLMs to meet the demands for responsible and ethically aligned AI systems.
翻译:大语言模型的进步已在多种应用中展现出卓越能力。这些模型擅长生成上下文连贯的文本补全,覆盖广泛主题领域。然而,其训练所需的海量数据集使得在预训练和指令微调阶段对齐响应风格面临挑战。因此,通常需要额外引入对齐阶段:通过人类偏好数据对模型进行进一步训练,使其输出更符合人类期望。虽然该过程本身并未引入新能力,但确实强化了模型固有的生成风格。本文探索了在直接偏好优化框架中利用反事实提示进行模型风格对齐的方法,该过程无需人工干预。我们证明,该方法能有效植入理想行为、抑制不良行为,并促使模型拒绝不恰当的指令。研究结果表明,基于反事实提示的DPO技术为以低资源方式微调大语言模型提供了可行路径,使其满足负责任的、符合伦理的人工智能系统需求。