Sequential sentence classification (SSC) in scientific publications is crucial for supporting downstream tasks such as fine-grained information retrieval and extractive summarization. However, current SSC methods are constrained by model size, sequence length, and single-label setting. To address these limitations, this paper proposes LLM-SSC, a large language model (LLM)-based framework for both single- and multi-label SSC tasks. Unlike previous approaches that employ small- or medium-sized language models, the proposed framework utilizes LLMs to generate SSC labels through designed prompts, which enhance task understanding by incorporating demonstrations and a query to describe the prediction target. We also present a multi-label contrastive learning loss with auto-weighting scheme, enabling the multi-label classification task. To support our multi-label SSC analysis, we introduce and release a new dataset, biorc800, which mainly contains unstructured abstracts in the biomedical domain with manual annotations. Experiments demonstrate LLM-SSC's strong performance in SSC under both in-context learning and task-specific tuning settings. We release biorc800 and our code at: https://github.com/ScienceNLP-Lab/LLM-SSC.
翻译:科学出版物中的序列句子分类(SSC)对于支持细粒度信息检索和抽取式摘要等下游任务至关重要。然而,当前的SSC方法受到模型规模、序列长度和单标签设置的限制。为应对这些局限性,本文提出LLM-SSC,一个基于大语言模型(LLM)的框架,适用于单标签和多标签SSC任务。与以往采用中小型语言模型的方法不同,所提框架利用LLM通过设计的提示生成SSC标签,这些提示通过融入示例和描述预测目标的查询来增强任务理解。我们还提出了一种带自动加权方案的多标签对比学习损失函数,以实现多标签分类任务。为支持我们的多标签SSC分析,我们引入并发布了一个新数据集biorc800,该数据集主要包含生物医学领域具有人工标注的非结构化摘要。实验表明,LLM-SSC在上下文学习和任务特定微调设置下均展现出强大的SSC性能。我们在 https://github.com/ScienceNLP-Lab/LLM-SSC 发布了biorc800数据集及代码。