Scientific publications follow conventionalized rhetorical structures. Classifying the Argumentative Zone (AZ), e.g., identifying whether a sentence states a Motivation, a Result or Background information, has been proposed to improve processing of scholarly documents. In this work, we adapt and extend this idea to the domain of materials science research. We present and release a new dataset of 50 manually annotated research articles. The dataset spans seven sub-topics and is annotated with a materials-science focused multi-label annotation scheme for AZ. We detail corpus statistics and demonstrate high inter-annotator agreement. Our computational experiments show that using domain-specific pre-trained transformer-based text encoders is key to high classification performance. We also find that AZ categories from existing datasets in other domains are transferable to varying degrees.
翻译:科学出版物遵循约定俗成的修辞结构。对论证区域(Argumentative Zone, AZ)进行分类(例如识别某句话是表述动机、结果还是背景信息)已被提出用于改进学术文献的处理。本研究将这一思路适应并扩展至材料科学研究领域。我们构建并发布了包含50篇人工标注研究论文的新数据集。该数据集涵盖七个子主题,并采用面向材料科学的多标签标注方案进行AZ标注。我们详细介绍了语料库统计特征,并展示了高水平的标注者间一致性。计算实验表明,使用领域特定的预训练Transformer文本编码器是实现高分类性能的关键。我们还发现,现有其他领域数据集中的AZ类别可在不同程度上进行迁移。