This study presents an ensemble approach that addresses the challenges of identification and analysis of research articles in rapidly evolving fields, using the field of Artificial Intelligence (AI) as a case study. Our approach included using decision tree, sciBERT and regular expression matching on different fields of the articles, and a SVM to merge the results from different models. We evaluated the effectiveness of our method on a manually labeled dataset, finding that our combined approach captured around 97% of AI-related articles in the web of science (WoS) corpus with a precision of 0.92. This presents a 0.15 increase in F1 score compared with existing search term based approach. Following this, we analyzed the publication volume trends and common research themes.We found that compared with existing methods, our ensemble approach revealed an increased degree of interdisciplinarity, and was able to identify more articles in certain subfields like feature extraction and optimization. This study demonstrates the potential of our approach as a tool for the accurate identification of scholarly articles, which is also capable of providing insights into the volume and content of a research area.
翻译:本研究提出了一种集成方法,以应对快速演进领域中研究论文识别与分析面临的挑战,并以人工智能(AI)领域为例进行实证分析。该方法整合了决策树、sciBERT模型及正则表达式匹配技术,分别作用于论文的不同字段,并通过支持向量机(SVM)融合各模型的输出结果。我们基于人工标注数据集评估了方法的有效性,结果显示:该组合方法在Web of Science(WoS)语料库中捕获了约97%的AI相关论文,精确率达0.92。与现有基于检索词的方法相比,F1分数提升了0.15。在此基础上,我们进一步分析了出版量趋势与常见研究主题,发现相较于传统方法,本集成方法揭示了更高的学科交叉程度,并能更有效地识别特征提取与优化等特定子领域的论文。本研究证实了该方法作为学术论文精准识别工具的潜力,同时能够为研究领域的规模与内容分析提供有价值的信息。