We propose a straightforward solution for detecting scarce topics in unbalanced short-text datasets. Our approach, named CWUTM (Topic model based on co-occurrence word networks for unbalanced short text datasets), Our approach addresses the challenge of sparse and unbalanced short text topics by mitigating the effects of incidental word co-occurrence. This allows our model to prioritize the identification of scarce topics (Low-frequency topics). Unlike previous methods, CWUTM leverages co-occurrence word networks to capture the topic distribution of each word, and we enhanced the sensitivity in identifying scarce topics by redefining the calculation of node activity and normalizing the representation of both scarce and abundant topics to some extent. Moreover, CWUTM adopts Gibbs sampling, similar to LDA, making it easily adaptable to various application scenarios. Our extensive experimental validation on unbalanced short-text datasets demonstrates the superiority of CWUTM compared to baseline approaches in discovering scarce topics. According to the experimental results the proposed model is effective in early and accurate detection of emerging topics or unexpected events on social platforms.
翻译:我们提出了一种解决不均衡短文本数据集中稀有主题检测问题的简便方案。该方法命名为CWUTM(基于共现词网络的不均衡短文本数据集主题模型),通过削弱偶然词共现的影响来应对短文本主题稀疏且不均衡的挑战,从而优先识别稀有主题(低频主题)。与以往方法不同,CWUTM利用共现词网络捕获每个词的主题分布,并通过重新定义节点活跃度计算方式以及在一定程度上标准化稀有主题与丰富主题的表征,提升了对稀有主题的敏感度。此外,CWUTM采用与LDA相似的吉布斯采样,使其易于适配各类应用场景。在不均衡短文本数据集上的大量实验验证表明,CWUTM在发现稀有主题方面优于基线方法。实验结果显示,该模型能够有效早期且准确检测社交平台上的新兴主题或突发事件。