An important aspect of text mining involves information retrieval in form of discovery of semantic themes (topics) from documents using topic modelling. While generative topic models like Latent Dirichlet Allocation (LDA) elegantly model topics as probability distributions and are useful in identifying latent topics from large document corpora with minimal supervision, they suffer from difficulty in topic interpretability and reduced performance in shorter texts. Here we propose a novel Multivariate Gaussian Topic modelling (MGD) approach. In this approach topics are presented as Multivariate Gaussian Distributions and documents as Gaussian Mixture Models. Using EM algorithm, the various constituent Multivariate Gaussian Distributions and their corresponding parameters are identified. Analysis of the parameters helps identify the keywords having the highest variance and mean contributions to the topic, and from these key-words topic annotations are carried out. This approach is first applied on a synthetic dataset to demonstrate the interpretability benefits vis-\`a-vis LDA. A real-world application of this topic model is demonstrated in analysis of risks and hazards at a petrochemical plant by applying the model on safety incident reports to identify the major latent hazards plaguing the plant. This model achieves a higher mean topic coherence of 0.436 vis-\`a-vis 0.294 for LDA.
翻译:文本挖掘的一个重要方面涉及以主题建模形式从文档中发现语义主题(主题)的信息检索。虽然像潜在狄利克雷分配(LDA)这样的生成式主题模型将主题优雅地建模为概率分布,并在最小监督下从大规模文档语料库中识别潜在主题方面很有用,但它们存在主题可解释性困难以及在较短文本中性能下降的问题。本文提出了一种新颖的多元高斯主题建模(MGD)方法。在该方法中,主题被表示为多元高斯分布,文档则被表示为高斯混合模型。通过使用EM算法,识别出各种构成多元高斯分布及其对应参数。分析这些参数有助于识别对主题具有最高方差和均值贡献的关键词,并基于这些关键词进行主题标注。该方法首先应用于合成数据集,以展示其相对于LDA在可解释性方面的优势。该主题模型的一个实际应用案例体现在对石化工厂风险和危害的分析中,通过将模型应用于安全事件报告来识别困扰工厂的主要潜在危害。该模型实现了0.436的平均主题连贯性,显著高于LDA的0.294。