Modelling and Classifying the Components of a Literature Review

Previous work has demonstrated that AI methods for analysing scientific literature benefit significantly from annotating sentences in papers according to their rhetorical roles, such as research gaps, results, limitations, extensions of existing methodologies, and others. Such representations also have the potential to support the development of a new generation of systems capable of producing high-quality literature reviews. However, achieving this goal requires the definition of a relevant annotation schema and effective strategies for large-scale annotation of the literature. This paper addresses these challenges in two ways: 1) it introduces a novel, unambiguous annotation schema that is explicitly designed for reliable automatic processing, and 2) it presents a comprehensive evaluation of a wide range of large language models (LLMs) on the task of classifying rhetorical roles according to this schema. To this end, we also present Sci-Sentence, a novel multidisciplinary benchmark comprising 700 sentences manually annotated by domain experts and 2,240 sentences automatically labelled using LLMs. We evaluate 37 LLMs on this benchmark, spanning diverse model families and sizes, using both zero-shot learning and fine-tuning approaches. The experiments reveal that modern LLMs achieve strong results on this task when fine-tuned on high-quality data, surpassing 96% F1, with both large proprietary models such as GPT-4o and lightweight open-source alternatives performing well. Moreover, augmenting the training set with semi-synthetic LLM-generated examples further boosts performance, enabling small encoders to achieve robust results and substantially improving several open decoder models.

翻译：先前的研究表明，根据修辞角色（如研究空白、结果、局限性、现有方法的扩展等）对论文中的句子进行标注，能显著提升分析科学文献的人工智能方法的效能。此类表示形式亦有望支持新一代能够生成高质量文献综述的系统开发。然而，实现这一目标需要定义相关的标注体系以及针对文献进行大规模标注的有效策略。本文通过两种方式应对这些挑战：1）提出了一种新颖、明确的标注体系，该体系专为可靠的自动处理而设计；2）依据该体系，对广泛的大型语言模型在修辞角色分类任务上的表现进行了全面评估。为此，我们还提出了Sci-Sentence，这是一个新颖的多学科基准数据集，包含由领域专家手动标注的700个句子以及使用LLM自动标注的2,240个句子。我们在此基准上评估了37个LLM，涵盖不同的模型系列与规模，并采用了零样本学习和微调两种方法。实验表明，现代LLM在高质量数据上微调后，在此任务上取得了强劲的结果，F1分数超过96%，无论是GPT-4o等大型专有模型还是轻量级开源替代模型均表现良好。此外，利用半合成的LLM生成样本增强训练集能进一步提升性能，使小型编码器模型获得稳健的结果，并显著改善多个开源解码器模型。