The scholarly publication space is growing steadily not just in numbers but also in complexity due to collaboration between individuals from within and across fields of research. This paper presents a hierarchical classification system that automatically categorizes a scholarly publication using its abstract into a three-tier hierarchical label set of fields (discipline-field-subfield). This system enables a holistic view about the interdependence of research activities in the mentioned hierarchical tiers in terms of knowledge production through articles and impact through citations. The classification system (44 disciplines - 738 fields - 1,501 subfields) utilizes and is able to cope with 160 million abstract snippets in Microsoft Academic Graph (Version 2018-05-17) using batch training in a modularized and distributed fashion to address and assess interdisciplinarity and inter-field classifications. In addition, we have explored multi-class classifications in both the single-label and multi-label settings. In total, we have conducted 3,140 experiments, in all models (Convolutional Neural Networks, Recurrent Neural Networks, Transformers), the classification accuracy is > 90% in 77.84% and 78.83% of the single-label and multi-label classifications, respectively. We examine the advantages of our classification by its ability to better align research texts and output with disciplines, to adequately classify them in an automated way, as well as to capture the degree of interdisciplinarity in a publication which enables downstream analytics such as field interdisciplinarity. This system (a set of pretrained models) can serve as a backbone to an interactive system of indexing scientific publications.
翻译:学术出版物空间不仅在数量上稳步增长,而且因跨研究领域内外的研究者合作而日益复杂。本文提出了一种层次分类系统,该系统利用学术出版物的摘要,自动将其归类到一个三层层次标签集(学科-领域-子领域)中。该系统能够从知识生产(通过文章)和影响力(通过引用)的角度,全面审视上述层次级别中研究活动的相互依赖性。该分类系统(44个学科-738个领域-1,501个子领域)利用并能够处理微软学术图谱(2018-05-17版本)中的1.6亿条摘要片段,采用模块化与分布式批量训练方式,以识别和评估跨学科性与跨领域分类。此外,我们还在单标签和多标签设置下探索了多类分类。我们共进行了3,140次实验,在所有模型(卷积神经网络、循环神经网络、Transformer)中,单标签和多标签分类的准确率分别有77.84%和78.83%超过90%。我们通过本分类系统在更好对齐研究文本与学科、自动化充分分类,以及捕捉出版物跨学科程度方面的能力,验证了其优势,这为领域跨学科性等下游分析提供了支持。该系统(一组预训练模型)可作为科学出版物交互式索引系统的基础框架。