This paper presents a hierarchical classification system that automatically categorizes a scholarly publication using its abstract into a three-tier hierarchical label set (discipline, field, subfield) in a multi-class setting. This system enables a holistic categorization of research activities in the mentioned hierarchy in terms of knowledge production through articles and impact through citations, permitting those activities to fall into multiple categories. The classification system distinguishes 44 disciplines, 718 fields and 1,485 subfields among 160 million abstract snippets in Microsoft Academic Graph (version 2018-05-17). We used batch training in a modularized and distributed fashion to address and allow for interdisciplinary and interfield classifications in single-label and multi-label settings. In total, we have conducted 3,140 experiments in all considered models (Convolutional Neural Networks, Recurrent Neural Networks, Transformers). The classification accuracy is > 90% in 77.13% and 78.19% of the single-label and multi-label classifications, respectively. We examine the advantages of our classification by its ability to better align research texts and output with disciplines, to adequately classify them in an automated way, and to capture the degree of interdisciplinarity. The proposed system (a set of pre-trained models) can serve as a backbone to an interactive system for indexing scientific publications in the future.
翻译:本文提出了一种层级分类系统,能够利用学术论文的摘要自动将其归类至三层级标签集(学科、领域、子领域)的多类别体系中。该系统可实现对研究活动在上述层级结构中的整体性分类,既涵盖通过论文产生的知识产出,也包括通过引用体现的学术影响力,并允许同一研究活动分属多个类别。该分类系统从微软学术图谱(2018-05-17版本)的1.6亿条摘要片段中识别出44个学科、718个领域和1485个子领域。我们采用模块化分布式批量训练方法,以解决单标签与多标签场景下的跨学科和跨领域分类问题。总计在全部模型(卷积神经网络、循环神经网络、Transformer)中开展了3140组实验。在单标签和多标签分类任务中,分别有77.13%和78.19%的测试样本分类准确率超过90%。我们通过评估该系统在以下方面的优势验证其实用价值:更精准地建立研究文本与学科的匹配关系、实现自动化分类的充分性、以及跨学科程度的量化表征。该拟议系统(一组预训练模型)可作为未来学术出版物交互式索引系统的底层支持框架。