Large language models (LLMs) often exhibit subtle yet distinctive characteristics in their outputs that users intuitively recognize, but struggle to quantify. These "vibes" -- such as tone, formatting, or writing style -- influence user preferences, yet traditional evaluations focus primarily on the singular axis of correctness. We introduce VibeCheck, a system for automatically comparing a pair of LLMs by discovering identifying traits of a model (vibes) that are well-defined, differentiating, and user-aligned. VibeCheck iteratively discovers vibes from model outputs and then utilizes a panel of LLM judges to quantitatively measure the utility of each vibe. We validate that the vibes generated by VibeCheck align with those found in human discovery and run VibeCheck on pairwise preference data from real-world user conversations with Llama-3-70b vs GPT-4. VibeCheck reveals that Llama has a friendly, funny, and somewhat controversial vibe. These vibes predict model identity with 80% accuracy and human preference with 61% accuracy. Lastly, we run VibeCheck on a variety of models and tasks including summarization, math, and captioning to provide insight into differences in model behavior. VibeCheck discovers vibes like Command X prefers to add concrete intros and conclusions when summarizing in comparison to TNGL, Llama-405b often overexplains its thought process on math problems compared to GPT-4o, and GPT-4 prefers to focus on the mood and emotions of the scene when captioning compared to Gemini-1.5-Flash. Code can be found at https://github.com/lisadunlap/VibeCheck
翻译:大型语言模型(LLM)的输出常表现出微妙而独特的特征,用户能直观感知却难以量化。这些“氛围特征”——如语气、格式或写作风格——会影响用户偏好,但传统评估主要聚焦于正确性这一单一维度。我们提出了VibeCheck,这是一个通过发现模型中定义清晰、具有区分度且符合用户认知的识别性特征(氛围特征)来自动比较一对LLM的系统。VibeCheck从模型输出中迭代地发现氛围特征,随后利用一组LLM评判员定量衡量每个氛围特征的效用。我们验证了VibeCheck生成的氛围特征与人工发现的特征相符,并在来自真实用户对话的Llama-3-70b与GPT-4的成对偏好数据上运行了VibeCheck。VibeCheck揭示出Llama具有友好、幽默且略带争议性的氛围特征。这些氛围特征能以80%的准确率预测模型身份,并以61%的准确率预测人类偏好。最后,我们在多种模型和任务(包括摘要、数学和图像描述)上运行VibeCheck,以深入揭示模型行为的差异。VibeCheck发现了诸如:在摘要任务中,与TNGL相比,Command X倾向于添加具体的引言和结论;在数学问题上,与GPT-4o相比,Llama-405b常常过度解释其思考过程;在图像描述任务中,与Gemini-1.5-Flash相比,GPT-4更偏好聚焦场景的情绪和情感等氛围特征。代码可在 https://github.com/lisadunlap/VibeCheck 获取。