A comprehensive study on Frequent Pattern Mining and Clustering categories for topic detection in Persian text stream

Topic detection is a complex process and depends on language because it somehow needs to analyze text. There have been few studies on topic detection in Persian, and the existing algorithms are not remarkable. Therefore, we aimed to study topic detection in Persian. The objectives of this study are: 1) to conduct an extensive study on the best algorithms for topic detection, 2) to identify necessary adaptations to make these algorithms suitable for the Persian language, and 3) to evaluate their performance on Persian social network texts. To achieve these objectives, we have formulated two research questions: First, considering the lack of research in Persian, what modifications should be made to existing frameworks, especially those developed in English, to make them compatible with Persian? Second, how do these algorithms perform, and which one is superior? There are various topic detection methods that can be categorized into different categories. Frequent pattern and clustering are selected for this research, and a hybrid of both is proposed as a new category. Then, ten methods from these three categories are selected. All of them are re-implemented from scratch, changed, and adapted with Persian. These ten methods encompass different types of topic detection methods and have shown good performance in English. The text of Persian social network posts is used as the dataset. Additionally, a new multiclass evaluation criterion, called FS, is used in this paper for the first time in the field of topic detection. Approximately 1.4 billion tokens are processed during experiments. The results indicate that if we are searching for keyword-topics that are easily understandable by humans, the hybrid category is better. However, if the aim is to cluster posts for further analysis, the frequent pattern category is more suitable.

翻译：主题检测是一个复杂过程，且依赖于语言，因为它需要在一定程度上分析文本。针对波斯语的主题检测研究较少，且现有算法并不出色。因此，我们旨在研究波斯语中的主题检测。本研究的目标包括：1）对最佳主题检测算法进行广泛研究；2）确定使这些算法适用于波斯语所需的必要调整；3）评估它们在波斯语社交网络文本上的性能。为实现这些目标，我们提出了两个研究问题：首先，鉴于波斯语领域研究的缺乏，应如何修改现有框架（尤其是那些针对英语开发的框架）以使其与波斯语兼容？其次，这些算法性能如何，哪种更优？存在多种主题检测方法，可划分为不同类别。本研究选择频繁模式和聚类，并提出两者的混合方法作为新类别。随后，从这三个类别中选取了十种方法，全部从头重新实现、修改并适配波斯语。这十种方法涵盖了不同类型的主题检测方法，并在英语中表现良好。数据集采用波斯语社交网络帖子文本。此外，本文首次在主题检测领域使用了一种名为FS的新多类别评估标准。实验过程中处理了约14亿个词元。结果表明，如果寻找易于人类理解的关键词主题，混合类别更优；但若旨在对帖子进行聚类以作进一步分析，则频繁模式类别更为合适。