The alignment problem in Large Language Models (LLMs) involves adapting them to the broad spectrum of human values. This requirement challenges existing alignment methods due to diversity of preferences and regulatory standards. This paper introduces a novel alignment paradigm, priority rule following, which defines rules as the primary control mechanism in each dialog, prioritizing them over user instructions. Our preliminary analysis reveals that even the advanced LLMs, such as GPT-4, exhibit shortcomings in understanding and prioritizing the rules. Therefore, we present PriorityDistill, a semi-automated approach for distilling priority following signals from LLM simulations to ensure robust rule integration and adherence. Our experiments show that this method not only effectively minimizes misalignments utilizing only one general rule but also adapts smoothly to various unseen rules, ensuring they are shielded from hijacking and that the model responds appropriately.
翻译:大语言模型(LLM)的对齐问题涉及使其适应广泛的人类价值观。由于偏好的多样性和监管标准的差异性,这一需求对现有的对齐方法构成了挑战。本文提出了一种新颖的对齐范式——优先级规则遵循,该范式将规则定义为每次对话中的主要控制机制,并使其优先级高于用户指令。我们的初步分析表明,即使是GPT-4等先进LLM,在理解和优先处理规则方面也存在不足。因此,我们提出了PriorityDistill——一种从LLM模拟中提取优先级遵循信号的半自动化方法,以确保规则的稳健整合与遵从。实验证明,该方法不仅能够仅通过一条通用规则有效减少未对齐问题,还能平滑适应各类未见规则,确保规则免受劫持,并使模型做出恰当响应。