Topic Modeling has become a prominent tool for the study of scientific fields, as they allow for a large scale interpretation of research trends. Nevertheless, the output of these models is structured as a list of keywords which requires a manual interpretation for the labelling. This paper proposes to assess the reliability of three LLMs, namely flan, GPT-4o, and GPT-4 mini for topic labelling. Drawing on previous research leveraging BERTopic, we generate topics from a dataset of all the scientific articles (n=34,797) authored by all biology professors in Switzerland (n=465) between 2008 and 2020, as recorded in the Web of Science database. We assess the output of the three models both quantitatively and qualitatively and find that, first, both GPT models are capable of accurately and precisely label topics from the models' output keywords. Second, 3-word labels are preferable to grasp the complexity of research topics.
翻译:主题建模已成为研究科学领域的重要工具,因其能够实现对研究趋势的大规模解读。然而,这些模型的输出结果通常以关键词列表的形式呈现,需要人工进行解读和标注。本文旨在评估三种大语言模型(即flan、GPT-4o和GPT-4 mini)在主题标注任务中的可靠性。基于先前利用BERTopic的研究基础,我们从Web of Science数据库中提取了2008年至2020年间瑞士所有生物学教授(n=465)发表的全部科学文献(n=34,797)构建数据集,并生成主题模型。通过定量与定性分析评估三种模型的输出结果,研究发现:首先,两种GPT模型均能基于模型输出的关键词实现准确而精确的主题标注;其次,采用三词标签能更好地把握研究主题的复杂性。