Mapping Scientific Literature with Large Language Models and Topic Modeling

Scientific literature is increasingly fragmented by disciplinary boundaries, specialized terminology, and potentially sparse keyword systems, making it difficult to capture the evolving structure of modern science. This study introduces a large language model (LLM)-driven framework for mapping scientific literature from a topic modeling perspective. The approach is demonstrated on a 20-year corpus of more than 1,500 engineering-related articles published in the Proceedings of the National Academy of Sciences (PNAS). A two-stage classification pipeline first assigns a primary thematic category to each article based on its abstract, followed by full-text analysis to identify secondary classifications that reveal latent cross-topic connections within the corpus. Unlike conventional topic models, the LLM-based framework produces semantically interpretable topics while maintaining strong quantitative performance. Comparative evaluation against established topic modeling methods shows higher topic diversity and lower overlap with competitive coherence metrics. Manual validation on a randomly sampled subset of abstracts yields an accuracy of 75.9%. Additional traditional natural language processing analyses confirm that the generated topics correspond to meaningful linguistic patterns in the corpus. A bipartite network linking primary and secondary classifications further reveals implicit thematic relationships that are not readily observable through abstracts or keyword systems alone. The findings indicate that the framework independently recovers much of the journal's editorial dual-classification structure without prior knowledge of its schema. Overall, the proposed approach offers a powerful tool for mapping science and identifying emerging cross-topic connections in research.

翻译：科学文献日益受到学科界限、专业术语以及可能稀疏的关键词体系的割裂，使得捕捉现代科学不断演进的结构变得困难。本研究提出了一种以大语言模型为驱动力的框架，从主题建模的视角绘制科学文献图谱。该方法在《美国国家科学院院刊》20年间发表的1500多篇工程相关文章构成的语料库上进行了演示。一个两阶段分类管道首先根据每篇文章的摘要分配一个主要主题类别，然后进行全文分析以识别次要分类，从而揭示语料库内潜在的跨主题联系。与传统主题模型不同，基于大语言模型的框架能够产生语义可解释的主题，同时保持强大的定量性能。与既定主题建模方法的比较评估显示，该框架具有更高的主题多样性和更低的重叠度，且连贯性度量指标具有竞争力。在随机抽样的摘要子集上进行的人工验证准确率达到75.9%。额外的传统自然语言处理分析证实，生成的主题与语料库中有意义的语言模式相对应。连接主要与次要分类的二部网络进一步揭示了仅通过摘要或关键词体系难以察觉的隐含主题关系。研究结果表明，该框架在未知期刊编辑双重分类结构先验知识的情况下，能独立地恢复其大部分结构。总体而言，所提出的方法为绘制科学图谱及识别研究中新兴的跨主题联系提供了有力工具。