As large language models (LLMs) continue to advance, aligning these models with human preferences has emerged as a critical challenge. Traditional alignment methods, relying on human or LLM annotated datasets, are limited by their resource-intensive nature, inherent subjectivity, misalignment with real-world user preferences, and the risk of feedback loops that amplify model biases. To overcome these limitations, we introduce WildFeedback, a novel framework that leverages in-situ user feedback during conversations with LLMs to create preference datasets automatically. Given a corpus of multi-turn user-LLM conversation, WildFeedback identifies and classifies user feedback to LLM responses between conversation turns. The user feedback is then used to create examples of preferred and dispreferred responses according to users' preference. Our experiments demonstrate that LLMs fine-tuned on WildFeedback dataset exhibit significantly improved alignment with user preferences, as evidenced by both traditional benchmarks and our proposed checklist-guided evaluation. By incorporating in-situ feedback from actual users, WildFeedback addresses the scalability, subjectivity, and bias challenges that plague existing approaches, marking a significant step toward developing LLMs that are more responsive to the diverse and evolving needs of their users.
翻译:随着大语言模型的持续发展,如何将这些模型与人类偏好对齐已成为关键挑战。传统的对齐方法依赖人工或大语言模型标注的数据集,受限于资源密集性、主观性、与真实用户偏好的偏差,以及可能放大模型偏见的反馈循环风险。为克服这些局限,我们提出WildFeedback框架——一种利用与大语言模型对话中产生的原位用户反馈自动构建偏好数据集的新方法。给定多轮用户-模型对话语料,WildFeedback可识别并分类对话轮次间的用户对模型回复的反馈,进而基于用户偏好生成偏好/非偏好响应示例。实验表明,经WildFeedback数据集微调的大语言模型在用户偏好对齐方面显著提升,传统基准测试与我们所提出的清单引导评估均验证了这一点。通过整合真实用户的原位反馈,WildFeedback有效解决了现有方法面临的扩展性、主观性与偏见问题,为开发更贴合用户多样化与动态需求的大语言模型迈出关键一步。