This paper presents the results of the first application of BERTopic, a state-of-the-art topic modeling technique, to short text written in a morphologi-cally rich language. We applied BERTopic with three multilingual embed-ding models on two levels of text preprocessing (partial and full) to evalu-ate its performance on partially preprocessed short text in Serbian. We also compared it to LDA and NMF on fully preprocessed text. The experiments were conducted on a dataset of tweets expressing hesitancy toward COVID-19 vaccination. Our results show that with adequate parameter setting, BERTopic can yield informative topics even when applied to partially pre-processed short text. When the same parameters are applied in both prepro-cessing scenarios, the performance drop on partially preprocessed text is minimal. Compared to LDA and NMF, judging by the keywords, BERTopic offers more informative topics and gives novel insights when the number of topics is not limited. The findings of this paper can be significant for re-searchers working with other morphologically rich low-resource languages and short text.
翻译:本文首次将前沿主题建模技术BERTopic应用于形态丰富语言的短文本处理。我们采用三种多语言嵌入模型,在两种文本预处理层级(部分预处理与完全预处理)上评估BERTopic在塞尔维亚语部分预处理短文本中的性能,并将其与基于完全预处理文本的LDA和NMF方法进行对比。实验基于表达新冠疫苗犹豫态度的推文数据集展开。结果表明,在适当参数设置下,BERTopic即使在部分预处理的短文本中也能生成具有信息量的主题。当两种预处理场景采用相同参数时,部分预处理文本的性能下降幅度极小。与LDA和NMF相比,从关键词判断,BERTopic能提供更具信息量的主题,且在主题数量不受限时揭示新颖见解。本研究成果对从事其他形态丰富的低资源语言及短文本研究的科研人员具有重要参考价值。