Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models

Aligning large language models (LLMs) with human values and intents critically involves the use of human or AI feedback. While dense feedback annotations are expensive to acquire and integrate, sparse feedback presents a structural design choice between ratings (e.g., score Response A on a scale of 1-7) and rankings (e.g., is Response A better than Response B?). In this work, we analyze the effect of this design choice for the alignment and evaluation of LLMs. We uncover an inconsistency problem wherein the preferences inferred from ratings and rankings significantly disagree 60% for both human and AI annotators. Our subsequent analysis identifies various facets of annotator biases that explain this phenomena, such as human annotators would rate denser responses higher while preferring accuracy during pairwise judgments. To our surprise, we also observe that the choice of feedback protocol also has a significant effect on the evaluation of aligned LLMs. In particular, we find that LLMs that leverage rankings data for alignment (say model X) are preferred over those that leverage ratings data (say model Y), with a rank-based evaluation protocol (is X/Y's response better than reference response?) but not with a rating-based evaluation protocol (score Rank X/Y's response on a scale of 1-7). Our findings thus shed light on critical gaps in methods for evaluating the real-world utility of language models and their strong dependence on the feedback protocol used for alignment. Our code and data are available at https://github.com/Hritikbansal/sparse_feedback.

翻译：大型语言模型与人类价值观和意图的对齐关键依赖于人类或人工智能反馈。虽然密集的反馈注释获取和整合成本高昂，但稀疏反馈在评分（例如，对响应A按1-7分打分）与排序（例如，响应A是否优于响应B？）之间提供了一种结构性设计选择。在本工作中，我们分析了这种设计选择对对齐和评估大型语言模型的影响。我们发现了一个不一致性问题：从评分和排序推断出的偏好存在显著分歧，在人类和人工智能注释者中均达到60%。随后的分析从多个角度揭示了注释者偏差的成因，例如人类注释者对更密集的响应给出更高评分，但在成对判断时更偏好准确性。令人意外的是，我们还观察到反馈协议的选择对对齐后的大型语言模型的评估也有显著影响。具体而言，使用排序数据进行对齐的模型（例如模型X）在使用基于排序的评估协议（模型X/Y的响应是否优于参考响应？）时优于使用评分数据的模型（例如模型Y），但在基于评分的评估协议（对模型X/Y的响应按1-7分打分）下则并非如此。因此，我们的发现揭示了用于评估语言模型实际效用方法的关键缺陷，以及这些方法对对齐所用反馈协议的强烈依赖性。我们的代码和数据可在https://github.com/Hritikbansal/sparse_feedback获取。