This study explores the application of topic modelling techniques Latent Dirichlet Allocation (LDA), Nonnegative Matrix Factorization (NMF), and Probabilistic Latent Semantic Analysis (PLSA) on the Socrata dataset spanning from 1908 to 2009. Categorized by operator type (military, commercial, and private), the analysis identified key themes such as pilot error, mechanical failure, weather conditions, and training deficiencies. The study highlights the unique strengths of each method: LDA ability to uncover overlapping themes, NMF production of distinct and interpretable topics, and PLSA nuanced probabilistic insights despite interpretative complexity. Statistical analysis revealed that PLSA achieved a coherence score of 0.32 and a perplexity value of -4.6, NMF scored 0.34 and 37.1, while LDA achieved the highest coherence of 0.36 but recorded the highest perplexity at 38.2. These findings demonstrate the value of topic modelling in extracting actionable insights from unstructured aviation safety narratives, aiding in the identification of risk factors and areas for improvement across sectors. Future directions include integrating additional contextual variables, leveraging neural topic models, and enhancing aviation safety protocols. This research provides a foundation for advanced text-mining applications in aviation safety management.
翻译:本研究探讨了潜在狄利克雷分配(LDA)、非负矩阵分解(NMF)和概率潜在语义分析(PLSA)三种主题建模技术在1908年至2009年Socrata数据集上的应用。通过按运营方类型(军用、商用和私人)进行分类,分析识别出飞行员失误、机械故障、天气条件和训练不足等关键主题。研究强调了每种方法的独特优势:LDA能够揭示重叠主题,NMF能生成清晰且可解释的主题,而PLSA尽管在解释上较为复杂,却能提供细致的概率洞察。统计分析显示,PLSA的一致性分数为0.32、困惑度为-4.6,NMF的分数为0.34和37.1,而LDA获得了最高的0.36一致性分数,但记录了最高的38.2困惑度。这些发现证明了主题建模在从非结构化航空安全叙事中提取可操作见解方面的价值,有助于识别跨部门的风险因素和改进领域。未来方向包括整合额外的上下文变量、利用神经主题模型以及增强航空安全协议。本研究为航空安全管理中的高级文本挖掘应用奠定了基础。