Practitioners increasingly rely on Large Language Models (LLMs) to evaluate generative AI outputs through "LLM-as-a-Judge" approaches. However, these methods produce holistic scores that obscure which specific elements influenced the assessments. We propose functional fragmentation, a method that dissects each output into key fragments and interprets the rhetoric functions that each fragment serves relative to evaluation criteria -- surfacing the elements of interest and revealing how they fulfill or hinder user goals. We instantiate this approach in Evalet, an interactive system that visualizes fragment-level functions across many outputs to support inspection, rating, and comparison of evaluations. A user study (N=10) found that, while practitioners struggled to validate holistic scores, our approach helped them identify 48% more evaluation misalignments. This helped them calibrate trust in LLM evaluations and rely on them to find more actionable issues in model outputs. Our work shifts LLM evaluation from quantitative scores toward qualitative, fine-grained analysis of model behavior.
翻译:摘要:实践者日益依赖大语言模型(LLM)采用"LLM作为评判者"方法评估生成式AI输出。然而,这些方法产生的整体评分掩盖了具体哪些要素影响了评估结果。我们提出功能片段化方法,该方法将每个输出分解为关键片段,并解读每个片段相对于评估标准所发挥的修辞功能——揭示感兴趣的要素,展现它们如何实现或阻碍用户目标。我们在Evalet交互系统中实现了该方法,该系统通过可视化众多输出中的片段级功能,支持评估的审查、评分与比较。用户研究(N=10)发现,尽管实践者难以验证整体评分,但我们的方法帮助他们多识别了48%的评估偏差,进而校准对LLM评估的信任度,并依赖这些方法在模型输出中发现更多可操作的缺陷。本研究将LLM评估从定量分数转向对模型行为的定性精细分析。