Automating Categorization of Scientific Texts with In-Context Learning and Prompt-Chaining in Large Language Models

The relentless expansion of scientific literature presents significant challenges for navigation and knowledge discovery. Within Research Information Retrieval, established tasks such as text summarization and classification remain crucial for enabling researchers and practitioners to effectively navigate this vast landscape, so that efforts have increasingly been focused on developing advanced research information systems. These systems aim not only to provide standard keyword-based search functionalities but also to incorporate capabilities for automatic content categorization within knowledge-intensive organizations across academia and industry. This study systematically evaluates the performance of off-the-shelf Large Language Models (LLMs) in analyzing scientific texts according to a given classification scheme. We utilized the hierarchical ORKG taxonomy as a classification framework, employing the FORC dataset as ground truth. We investigated the effectiveness of advanced prompt engineering strategies, namely In-Context Learning (ICL) and Prompt Chaining, and experimentally explored the influence of the LLMs' temperature hyperparameter on classification accuracy. Our experiments demonstrate that Prompt Chaining yields superior classification accuracy compared to pure ICL, particularly when applied to the nested structure of the ORKG taxonomy. LLMs with prompt chaining outperform the state-of-the-art models for domain (1st level) prediction and show even better performance for subject (2nd level) prediction compared to the older BERT model. However, LLMs are not yet able to perform well in classifying the topic (3rd level) of research areas based on this specific hierarchical taxonomy, as they only reach about 50% accuracy even with prompt chaining.

翻译：科学文献的持续激增给导航和知识发现带来了巨大挑战。在研究信息检索领域，文本摘要和分类等成熟任务对于帮助研究人员和实践者有效浏览这一庞大领域仍至关重要，因此相关研究日益聚焦于开发先进的研究信息系统。这些系统不仅旨在提供基于关键词的标准搜索功能，还致力于在学术界和工业界的知识密集型组织中融入自动内容分类能力。本研究系统评估了现成大语言模型（LLMs）根据给定分类方案分析科学文本的性能。我们以层次化ORKG分类法作为分类框架，采用FORC数据集作为基准真值。我们探究了高级提示工程策略（即上下文学习（ICL）和提示链）的有效性，并通过实验分析了LLMs的温度超参数对分类准确率的影响。实验表明，与纯ICL相比，提示链在分类准确率上具有显著优势，尤其是在处理ORKG分类法的嵌套结构时。采用提示链的LLMs在领域（第一层级）预测中超越了现有最优模型，在学科（第二层级）预测中的表现甚至优于较旧的BERT模型。然而，基于该特定层次分类法，LLMs在研究领域主题（第三层级）分类方面仍表现欠佳——即便使用提示链，其准确率也仅达约50%。