VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models

from arxiv, unironic use of the word 'vibe', added more analysis and cooler graphs. arXiv admin note: text overlap with arXiv:2301.07597 by other authors

Large language models (LLMs) often exhibit subtle yet distinctive characteristics in their outputs that users intuitively recognize, but struggle to quantify. These "vibes" -- such as tone, formatting, or writing style -- influence user preferences, yet traditional evaluations focus primarily on the singular axis of correctness. We introduce VibeCheck, a system for automatically comparing a pair of LLMs by discovering identifying traits of a model (vibes) that are well-defined, differentiating, and user-aligned. VibeCheck iteratively discovers vibes from model outputs and then utilizes a panel of LLM judges to quantitatively measure the utility of each vibe. We validate that the vibes generated by VibeCheck align with those found in human discovery and run VibeCheck on pairwise preference data from real-world user conversations with Llama-3-70b vs GPT-4. VibeCheck reveals that Llama has a friendly, funny, and somewhat controversial vibe. These vibes predict model identity with 80% accuracy and human preference with 61% accuracy. Lastly, we run VibeCheck on a variety of models and tasks including summarization, math, and captioning to provide insight into differences in model behavior. VibeCheck discovers vibes like Command X prefers to add concrete intros and conclusions when summarizing in comparison to TNGL, Llama-405b often overexplains its thought process on math problems compared to GPT-4o, and GPT-4 prefers to focus on the mood and emotions of the scene when captioning compared to Gemini-1.5-Flash. Code can be found at https://github.com/lisadunlap/VibeCheck

翻译：大型语言模型（LLM）的输出常表现出微妙而独特的特征，用户能直观感知却难以量化。这些“氛围特征”——如语气、格式或写作风格——会影响用户偏好，但传统评估主要聚焦于正确性这一单一维度。我们提出了VibeCheck，这是一个通过发现模型中定义清晰、具有区分度且符合用户认知的识别性特征（氛围特征）来自动比较一对LLM的系统。VibeCheck从模型输出中迭代地发现氛围特征，随后利用一组LLM评判员定量衡量每个氛围特征的效用。我们验证了VibeCheck生成的氛围特征与人工发现的特征相符，并在来自真实用户对话的Llama-3-70b与GPT-4的成对偏好数据上运行了VibeCheck。VibeCheck揭示出Llama具有友好、幽默且略带争议性的氛围特征。这些氛围特征能以80%的准确率预测模型身份，并以61%的准确率预测人类偏好。最后，我们在多种模型和任务（包括摘要、数学和图像描述）上运行VibeCheck，以深入揭示模型行为的差异。VibeCheck发现了诸如：在摘要任务中，与TNGL相比，Command X倾向于添加具体的引言和结论；在数学问题上，与GPT-4o相比，Llama-405b常常过度解释其思考过程；在图像描述任务中，与Gemini-1.5-Flash相比，GPT-4更偏好聚焦场景的情绪和情感等氛围特征。代码可在 https://github.com/lisadunlap/VibeCheck 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日