Summary assessment involves evaluating how well a generated summary reflects the key ideas and meaning of the source text, requiring a deep understanding of the content. Large Language Models (LLMs) have been used to automate this process, acting as judges to evaluate summaries with respect to the original text. While previous research investigated the alignment between LLMs and Human responses, it is not yet well understood what properties or features are exploited by them when asked to evaluate based on a particular quality dimension, and there has not been much attention towards mapping between evaluation scores and metrics. In this paper, we address this issue and discover features aligned with Human and Generative Pre-trained Transformers (GPTs) responses by studying statistical and machine learning metrics. Furthermore, we show that instructing GPTs to employ metrics used by Human can improve their judgment and conforming them better with human responses.
翻译:摘要评估涉及衡量生成摘要反映原文核心思想与含义的程度,这需要对内容有深入理解。大型语言模型(LLMs)已被用于自动化这一过程,充当评估者以基于原文评判摘要质量。尽管先前研究探讨了LLMs与人类评估之间的一致性,但尚未充分理解当要求基于特定质量维度进行评估时,它们利用了哪些属性或特征,且对评估分数与度量指标之间的映射关系关注不足。本文针对这一问题,通过研究统计与机器学习指标,发现了与人类及生成式预训练Transformer(GPTs)评估响应相一致的特征。此外,我们证明指导GPTs采用人类使用的评估指标能够改进其判断能力,并使其评估结果更符合人类响应。