Practitioners increasingly rely on Large Language Models (LLMs) to evaluate generative AI outputs through "LLM-as-a-Judge" approaches. However, these methods produce holistic scores that obscure which specific elements influenced the assessments. We propose functional fragmentation, a method that dissects each output into key fragments and interprets the rhetoric functions that each fragment serves relative to evaluation criteria -- surfacing the elements of interest and revealing how they fulfill or hinder user goals. We instantiate this approach in Evalet, an interactive system that visualizes fragment-level functions across many outputs to support inspection, rating, and comparison of evaluations. A user study (N=10) found that, while practitioners struggled to validate holistic scores, our approach helped them identify 48% more evaluation misalignments. This helped them calibrate trust in LLM evaluations and rely on them to find more actionable issues in model outputs. Our work shifts LLM evaluation from quantitative scores toward qualitative, fine-grained analysis of model behavior.
翻译:从业者日益依赖大型语言模型(LLMs),通过“LLM即评判者”的方法来评估生成式人工智能的输出。然而,这些方法产生的是整体性评分,掩盖了影响评估的具体要素。我们提出功能片段化方法,该方法将每个输出分解为关键片段,并解释每个片段相对于评估标准所发挥的修辞功能——从而揭示相关要素,并展现它们如何实现或阻碍用户目标。我们在Evalet中实现了这一方法,这是一个交互式系统,可跨多个输出可视化片段级功能,以支持评估的检查、评分和比较。一项用户研究(N=10)发现,尽管从业者在验证整体评分时遇到困难,但我们的方法帮助他们多识别出48%的评估偏差。这有助于他们校准对LLM评估的信任,并依赖这些评估在模型输出中发现更具可操作性的问题。我们的工作将LLM评估从定量评分转向对模型行为的定性、细粒度分析。