Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models

Aligning large language models (LLMs) with human values and intents critically involves the use of human or AI feedback. While dense feedback annotations are expensive to acquire and integrate, sparse feedback presents a structural design choice between ratings (e.g., score Response A on a scale of 1-7) and rankings (e.g., is Response A better than Response B?). In this work, we analyze the effect of this design choice for the alignment and evaluation of LLMs. We uncover an inconsistency problem wherein the preferences inferred from ratings and rankings significantly disagree 60% for both human and AI annotators. Our subsequent analysis identifies various facets of annotator biases that explain this phenomena, such as human annotators would rate denser responses higher while preferring accuracy during pairwise judgments. To our surprise, we also observe that the choice of feedback protocol also has a significant effect on the evaluation of aligned LLMs. In particular, we find that LLMs that leverage rankings data for alignment (say model X) are preferred over those that leverage ratings data (say model Y), with a rank-based evaluation protocol (is X/Y's response better than reference response?) but not with a rating-based evaluation protocol (score Rank X/Y's response on a scale of 1-7). Our findings thus shed light on critical gaps in methods for evaluating the real-world utility of language models and their strong dependence on the feedback protocol used for alignment. Our code and data are available at https://github.com/Hritikbansal/sparse_feedback.

翻译：大型语言模型（LLMs）与人类价值观和意图的对齐，关键依赖于人类或AI反馈。密集的反馈标注获取与集成成本高昂，而稀疏反馈则面临一种结构设计选择：评分（例如，在1-7分范围内对回答A进行打分）与排序（例如，回答A是否优于回答B？）。本研究分析了这一设计选择对LLMs对齐与评估的影响。我们发现一个不一致性问题：从评分和排序中推断出的偏好存在显著差异，人类和AI标注者的不一致率均达60%。后续分析揭示了造成这一现象的多种标注者偏差特征，例如人类标注者在成对判断中，倾向于对更密集的回答给予更高评分，同时更偏好准确性。令人惊讶的是，我们还观察到反馈协议的选择对对齐后LLMs的评估具有显著影响。具体而言，我们发现，在基于排序的评估协议（即回答X/Y是否优于参考回答？）下，采用排序数据进行对齐的LLMs（如模型X）优于采用评分数据的LLMs（如模型Y），但在基于评分的评估协议（即在1-7分范围内对回答X/Y进行打分）下则不然。本研究的结果揭示了评估语言模型实际效用方法中存在的关键缺陷，以及这些方法对对齐所用反馈协议的强烈依赖性。我们的代码和数据见：https://github.com/Hritikbansal/sparse_feedback。