Evaluating LLMs is challenging, as benchmark scores often fail to capture models' real-world usefulness. Instead, users often rely on ``vibe-testing'': informal experience-based evaluation, such as comparing models on coding tasks related to their own workflow. While prevalent, vibe-testing is often too ad hoc and unstructured to analyze or reproduce at scale. In this work, we study how vibe-testing works in practice and then formalize it to support systematic analysis. We first analyze two empirical resources: (1) a survey of user evaluation practices, and (2) a collection of in-the-wild model comparison reports from blogs and social media. Based on these resources, we formalize vibe-testing as a two-part process: users personalize both what they test and how they judge responses. We then introduce a proof-of-concept evaluation pipeline that follows this formulation by generating personalized prompts and comparing model outputs using user-aware subjective criteria. In experiments on coding benchmarks, we find that combining personalized prompts and user-aware evaluation can change which model is preferred, reflecting the role of vibe-testing in practice. These findings suggest that formalized vibe-testing can serve as a useful approach for bridging benchmark scores and real-world experience.
翻译:评估大语言模型(LLM)具有挑战性,因为基准测试分数往往无法反映模型在现实世界中的实用价值。相反,用户常依赖“氛围测试”(vibe-testing):一种非正式的经验性评估方式,例如在与其工作流相关的编码任务上比较模型。尽管氛围测试普遍存在,但其往往过于临时和缺乏结构化,难以大规模分析或复现。本研究通过探究氛围测试的实际运作方式,并将其形式化以支持系统性分析。我们首先分析两类实证资源:(1)用户评估实践调查,以及(2)来自博客和社交媒体的非受控模型比较报告。基于这些资源,我们将氛围测试形式化为一个包含两个环节的过程:用户个性化选择测试内容与评判回答方式。随后,我们引入一个概念验证评估流水线,通过生成个性化提示词并基于用户感知的主观标准比较模型输出,遵循上述形式化框架。在编码基准实验中发现,结合个性化提示词与用户感知评估能够改变模型偏好结果,这体现了氛围测试在实践中的作用。研究结果表明,形式化的氛围测试可成为连接基准分数与现实体验的有效桥梁。