The summarization capabilities of pretrained and large language models (LLMs) have been widely validated in general areas, but their use in scientific corpus, which involves complex sentences and specialized knowledge, has been less assessed. This paper presents conceptual and experimental analyses of scientific summarization, highlighting the inadequacies of traditional evaluation methods, such as $n$-gram, embedding comparison, and QA, particularly in providing explanations, grasping scientific concepts, or identifying key content. Subsequently, we introduce the Facet-aware Metric (FM), employing LLMs for advanced semantic matching to evaluate summaries based on different aspects. This facet-aware approach offers a thorough evaluation of abstracts by decomposing the evaluation task into simpler subtasks.Recognizing the absence of an evaluation benchmark in this domain, we curate a Facet-based scientific summarization Dataset (FD) with facet-level annotations. Our findings confirm that FM offers a more logical approach to evaluating scientific summaries. In addition, fine-tuned smaller models can compete with LLMs in scientific contexts, while LLMs have limitations in learning from in-context information in scientific domains. This suggests an area for future enhancement of LLMs.
翻译:预训练模型与大型语言模型(LLMs)在通用领域的摘要生成能力已得到广泛验证,但其在涉及复杂句式与专业知识的科学语料中的应用尚未得到充分评估。本文从概念与实验两个维度对科学摘要展开分析,揭示了传统评估方法(如n-gram、嵌入比较及问答式评估)在提供解释能力、把握科学概念或识别关键内容方面存在的不足。基于此,我们提出面向方面的评估指标(FM),利用LLMs进行高级语义匹配,从不同维度对摘要质量进行评估。这种面向方面的方法通过将评估任务分解为若干简单子任务,实现对摘要的全面评估。鉴于该领域缺乏评估基准,我们构建了包含方面级标注的面向方面科学摘要数据集(FD)。研究结果表明,FM为科学摘要评估提供了更具逻辑性的方法。此外,在科学语境中,经过微调的小型模型可与LLMs一争高下,而LLMs在利用科学领域的上下文信息进行学习方面存在局限,这为LLMs的未来改进指明了方向。