While substantial advancements have been made in developing large language models (LLMs), achieving control over their behavior can be difficult. Direct preference optimization (DPO) assumes the existence of a latent reward function to evaluate the responses of LLMs. This assumption indicates a strict preference ordering of different responses to the same input. However, there always exist contradictions of preference in LLMs according to our experimental observations. In this paper, we construct a graph structure of the preference relationship among different responses with self-annotation to find contradictions in the preference order. We propose ContraSolver, an algorithm that traverses all edges on the preference graph to identify those that might cause contradictions. ContraSolver initializes the graph with a maximum spanning tree and identifies contradictory edges, prioritizing the resolution of low-confidence preferences while preserving high-confidence ones. Experimental results on four different generation tasks show that the performance of different LLMs can be largely improved through our completely unsupervised self-alignment. Furthermore, by analyzing the preference graphs of LLMs with and without self-alignment by ContraSolver, we quantify the reduction in contradictions, suggesting that resolving preference contradictions is crucial for achieving better alignment performance.
翻译:摘要:尽管在大语言模型(LLMs)开发方面取得了显著进展,但实现对其行为的控制仍具挑战性。直接偏好优化(DPO)假设存在一个潜在奖励函数来评估LLMs的响应,该假设表明同一输入对应的不同响应之间存在严格的偏好排序。然而,根据我们的实验观察,LLMs中始终存在偏好矛盾。本文通过自标注构建不同响应间偏好关系的图结构,以发现偏好顺序中的矛盾。我们提出ContraSolver算法,该算法遍历偏好图中的所有边,识别可能引发矛盾的边。ContraSolver通过最大生成树初始化图,并在识别矛盾边时优先解决低置信度偏好,同时保留高置信度偏好。在四种不同生成任务上的实验结果表明,通过完全无监督的自我对齐,不同LLMs的性能均能得到显著提升。此外,通过分析使用ContraSolver进行自我对齐前后的LLM偏好图,我们量化了矛盾的减少程度,表明解决偏好矛盾对于实现更优对齐性能至关重要。