Capturing research literature attitude towards Sustainable Development Goals: an LLM-based topic modeling approach

The world is facing a multitude of challenges that hinder the development of human civilization and the well-being of humanity on the planet. The Sustainable Development Goals (SDGs) were formulated by the United Nations in 2015 to address these global challenges by 2030. Natural language processing techniques can help uncover discussions on SDGs within research literature. We propose a completely automated pipeline to 1) fetch content from the Scopus database and prepare datasets dedicated to five groups of SDGs; 2) perform topic modeling, a statistical technique used to identify topics in large collections of textual data; and 3) enable topic exploration through keywords-based search and topic frequency time series extraction. For topic modeling, we leverage the stack of BERTopic scaled up to be applied on large corpora of textual documents (we find hundreds of topics on hundreds of thousands of documents), introducing i) a novel LLM-based embeddings computation for representing scientific abstracts in the continuous space and ii) a hyperparameter optimizer to efficiently find the best configuration for any new big datasets. We additionally produce the visualization of results on interactive dashboards reporting topics' temporal evolution. Results are made inspectable and explorable, contributing to the interpretability of the topic modeling process. Our proposed LLM-based topic modeling pipeline for big-text datasets allows users to capture insights on the evolution of the attitude toward SDGs within scientific abstracts in the 2006-2023 time span. All the results are reproducible by using our system; the workflow can be generalized to be applied at any point in time to any big corpus of textual documents.

翻译：世界正面临多重挑战，这些挑战阻碍了人类文明的发展以及人类在地球上的福祉。联合国于2015年制定了可持续发展目标（SDGs），旨在到203年前应对这些全球性挑战。自然语言处理技术有助于揭示研究文献中对SDGs的讨论。我们提出了一套完全自动化的流程：1）从Scopus数据库获取内容，并为五组SDGs分别准备专用数据集；2）执行主题建模，这是一种用于识别大规模文本数据集中主题的统计技术；3）支持通过基于关键词的搜索和主题频率时间序列提取进行主题探索。在主题建模方面，我们利用可扩展的BERTopic框架，将其应用于大规模文本语料库（我们在数十万份文档中发现了数百个主题），并引入了：i）一种新颖的基于LLM的嵌入计算方法，用于在连续空间中表示科学摘要；ii）一种超参数优化器，可高效地为任何新的大型数据集找到最佳配置。我们还通过交互式仪表板呈现结果的可视化，报告主题的时间演变。结果具备可检视性和可探索性，有助于提升主题建模过程的可解释性。我们提出的基于LLM的大文本数据集主题建模流程，使用户能够捕捉2006-2023年间科学摘要中对SDGs态度的演变趋势。所有结果均可通过我们的系统复现；该工作流程具有普适性，可应用于任何时间点的任何大型文本语料库。