Scientific literature is increasingly siloed by complex language, static disciplinary structures, and potentially sparse keyword systems, making it cumbersome to capture the dynamic nature of modern science. This study addresses these challenges by introducing an adaptable large language model (LLM)-driven framework to quantify thematic trends and map the evolving landscape of scientific knowledge. The approach is demonstrated over a 20-year collection of more than 1,500 engineering articles published by the Proceedings of the National Academy of Sciences (PNAS), marked for their breadth and depth of research focus. A two-stage classification pipeline first establishes a primary thematic category for each article based on its abstract. The subsequent phase performs a full-text analysis to assign secondary classifications, revealing latent, cross-topic connections across the corpus. Traditional natural language processing (NLP) methods, such as Bag-of-Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF), confirm the resulting topical structure and also suggest that standalone word-frequency analyses may be insufficient for mapping fields with high diversity. Finally, a disjoint graph representation between the primary and secondary classifications reveals implicit connections between themes that may be less apparent when analyzing abstracts or keywords alone. The findings show that the approach independently recovers much of the journal's editorially embedded structure without prior knowledge of its existing dual-classification schema (e.g., biological studies also classified as engineering). This framework offers a powerful tool for detecting potential thematic trends and providing a high-level overview of scientific progress.
翻译:科学文献日益受到复杂语言、静态学科结构和潜在稀疏关键词体系的限制,使得捕捉现代科学的动态本质变得繁琐。本研究通过引入一种适应性强的、由大语言模型驱动的框架来量化主题趋势并描绘科学知识的演化图景,以应对这些挑战。该方法的演示基于《美国国家科学院院刊》在20年间发表的1500多篇工程学文章,这些文章以其研究焦点的广度和深度而著称。一个两阶段的分类流程首先根据每篇文章的摘要确定其主主题类别。随后的阶段执行全文分析以分配次级分类,从而揭示整个语料库中潜在的跨主题联系。传统的自然语言处理方法,如词袋模型和词频-逆文档频率,证实了所得的主题结构,同时也表明独立的词频分析可能不足以描绘高度多样化领域的图景。最后,主分类与次级分类之间的分离图表示揭示了主题之间隐含的联系,这些联系在单独分析摘要或关键词时可能不太明显。研究结果表明,该方法在事先不了解期刊现有双重分类体系的情况下,独立地恢复了该期刊大部分编辑嵌入的结构。该框架为检测潜在的主题趋势和提供科学进展的高层概览提供了一个强有力的工具。