Topic modeling is a key method in text analysis, but existing approaches are limited by assuming one topic per document or fail to scale efficiently for large, noisy datasets of short texts. We introduce Semantic Component Analysis (SCA), a novel topic modeling technique that overcomes these limitations by discovering multiple, nuanced semantic components beyond a single topic in short texts which we accomplish by introducing a decomposition step to the clustering-based topic modeling framework. We evaluate SCA on Twitter datasets in English, Hausa and Chinese. It achieves competetive coherence and diversity compared to BERTopic, while uncovering at least double the semantic components and maintaining a noise rate close to zero. Furthermore, SCA is scalable and effective across languages, including an underrepresented one.
翻译:主题建模是文本分析中的关键方法,但现有方法因假设每篇文档仅有一个主题或无法高效扩展至大规模、嘈杂的短文本数据集而受限。我们提出了语义成分分析(SCA),这是一种新颖的主题建模技术,通过在短文本中发现超越单一主题的多个细致语义成分来克服这些限制,这是通过在基于聚类的主题建模框架中引入分解步骤实现的。我们在英语、豪萨语和中文的Twitter数据集上评估了SCA。与BERTopic相比,SCA在连贯性和多样性方面表现相当,同时发现的语义成分数量至少翻倍,且噪声率接近零。此外,SCA具有良好的可扩展性,并在包括代表性不足语言在内的多种语言中均表现有效。