Enriched BERT Embeddings for Scholarly Publication Classification

With the rapid expansion of academic literature and the proliferation of preprints, researchers face growing challenges in manually organizing and labeling large volumes of articles. The NSLP 2024 FoRC Shared Task I addresses this challenge organized as a competition. The goal is to develop a classifier capable of predicting one of 123 predefined classes from the Open Research Knowledge Graph (ORKG) taxonomy of research fields for a given article.This paper presents our results. Initially, we enrich the dataset (containing English scholarly articles sourced from ORKG and arXiv), then leverage different pre-trained language Models (PLMs), specifically BERT, and explore their efficacy in transfer learning for this downstream task. Our experiments encompass feature-based and fine-tuned transfer learning approaches using diverse PLMs, optimized for scientific tasks, including SciBERT, SciNCL, and SPECTER2. We conduct hyperparameter tuning and investigate the impact of data augmentation from bibliographic databases such as OpenAlex, Semantic Scholar, and Crossref. Our results demonstrate that fine-tuning pre-trained models substantially enhances classification performance, with SPECTER2 emerging as the most accurate model. Moreover, enriching the dataset with additional metadata improves classification outcomes significantly, especially when integrating information from S2AG, OpenAlex and Crossref. Our best-performing approach achieves a weighted F1-score of 0.7415. Overall, our study contributes to the advancement of reliable automated systems for scholarly publication categorization, offering a potential solution to the laborious manual curation process, thereby facilitating researchers in efficiently locating relevant resources.

翻译：随着学术文献的快速扩展和预印本激增，研究人员在手动整理和标注大量文章方面面临日益严峻的挑战。本次NSLP 2024 FoRC共享任务I以竞赛形式应对这一挑战，目标是开发一个分类器，能够针对给定文章，从开放研究知识图谱（ORKG）研究领域分类体系中的123个预定义类别中预测其所属类别。本文介绍了我们的研究成果。我们首先对数据集（包含来自ORKG和arXiv的英文学术文章）进行增强，然后利用不同的预训练语言模型（PLMs），特别是BERT，并探索其在迁移学习中用于该下游任务的有效性。我们的实验涵盖基于特征和微调的迁移学习方法，使用了针对科学任务优化的多种PLMs，包括SciBERT、SciNCL和SPECTER2。我们进行了超参数调优，并研究了从OpenAlex、Semantic Scholar和Crossref等文献数据库中进行数据增强的影响。结果表明，微调预训练模型显著提升了分类性能，其中SPECTER2成为最准确的模型。此外，利用额外元数据增强数据集显著改善了分类结果，尤其是在整合S2AG、OpenAlex和Crossref的信息时。我们表现最佳的方法实现了加权F1分数0.7415。总体而言，本研究为推进可靠的学术出版物自动分类系统做出了贡献，为繁琐的手动整理过程提供了潜在解决方案，从而帮助研究人员高效定位相关资源。