The success of the reward model in distinguishing between responses with subtle safety differences depends critically on the high-quality preference dataset, which should capture the fine-grained nuances of harmful and harmless responses. This motivates the need to develop a dataset involving preference margins, which accurately quantify how harmless one response is compared to another. In this paper, we take the first step to propose an effective and cost-efficient framework to promote the margin-enhanced preference dataset development. Our framework, Legend, Leverages representation engineering to annotate preference datasets. It constructs the specific direction within the LLM's embedding space that represents safety. By leveraging this safety direction, Legend can then leverage the semantic distances of paired responses along this direction to annotate margins automatically. We experimentally demonstrate our effectiveness in both reward modeling and harmless alignment for LLMs. Legend also stands out for its efficiency, requiring only the inference time rather than additional training. This efficiency allows for easier implementation and scalability, making Legend particularly valuable for practical applications in aligning LLMs with safe conversations.
翻译:奖励模型在区分细微安全差异的响应方面的成功,关键取决于高质量偏好数据集,该数据集应捕捉有害与无害响应之间的细微差别。这促使需要开发涉及偏好裕度的数据集,以精确量化一个响应相较于另一个的无害程度。本文迈出了第一步,提出了一种高效且成本低廉的框架,以促进增强裕度的偏好数据集开发。我们的框架Legend利用表示工程来标注偏好数据集。它在LLM的嵌入空间中构建了代表安全性的特定方向。通过利用这一安全方向,Legend随后可借助配对响应沿此方向的语义距离自动标注裕度。我们实验证明了该方法在LLM的奖励建模与无害对齐中的有效性。Legend还以高效性著称,仅需推理时间而无需额外训练。这种高效性使其易于实现和扩展,使Legend在将LLM与安全对话对齐的实际应用中尤为宝贵。