Large language models (LLMs) are increasingly used as collaborative assistants, yet dominant NLP evaluation practices remain centered on aggregate metrics such as accuracy and fluency. These approaches often overlook behaviors that are critical in human-facing settings (e.g., consistency across multiple turns and iterative refinement). In this paper, we examine limitations of current NLP evaluation practices and introduce TCR, a structured framework for evaluating human--AI interaction using educational LLM assistants as an illustrative example. TCR emphasizes dimensions such as transparency, consistency, and refinement. We further present structured evaluation prompts and illustrative interaction examples demonstrating how structured evaluation can complement aggregate metrics and LLM-as-a-judge approaches. Our work highlights the need for more human-centered evaluation practices for interactive LLM systems.
翻译:大型语言模型(LLMs)正越来越多地被用作协作式助手,然而当前自然语言处理(NLP)的主流评估实践仍以准确率、流畅性等聚合指标为核心。这类方法往往忽视了在人机交互场景中至关重要的行为特征(例如多轮对话中的一致性与迭代优化能力)。本文探讨了当前NLP评估实践的局限性,并以教育型LLM助手为例,提出了一个名为TCR的结构化评估框架。TCR强调透明度、一致性与优化能力等维度。我们进一步展示了结构化评估提示与交互示例,说明结构化评估如何能够补充聚合指标及"LLM-as-a-judge"评估方法。本研究凸显了面向交互式LLM系统的人本化评估实践的必要性。