The news landscape is continuously evolving, with an ever-increasing volume of information from around the world. Automated event detection within this vast data repository is essential for monitoring, identifying, and categorizing significant news occurrences across diverse platforms. This paper presents an event detection framework that leverages Large Language Models (LLMs) combined with clustering analysis to detect news events from the Global Database of Events, Language, and Tone (GDELT). The framework enhances event clustering through both pre-event detection tasks (keyword extraction and text embedding) and post-event detection tasks (event summarization and topic labelling). We also evaluate the impact of various textual embeddings on the quality of clustering outcomes, ensuring robust news categorization. Additionally, we introduce a novel Cluster Stability Assessment Index (CSAI) to assess the validity and robustness of clustering results. CSAI utilizes multiple feature vectors to provide a new way of measuring clustering quality. Our experiments indicate that the use of LLM embedding in the event detection framework has significantly improved the results, demonstrating greater robustness in terms of CSAI scores. Moreover, post-event detection tasks generate meaningful insights, facilitating effective interpretation of event clustering results. Overall, our experimental results indicate that the proposed framework offers valuable insights and could enhance the accuracy in news analysis and reporting.
翻译:新闻格局持续演变,全球信息量不断增长。在这一庞大的数据存储库中进行自动化事件检测,对于跨平台监测、识别和分类重大新闻事件至关重要。本文提出了一种事件检测框架,该框架利用大语言模型(LLMs)结合聚类分析,从全球事件、语言与语调数据库(GDELT)中检测新闻事件。该框架通过事件检测前任务(关键词提取与文本嵌入)和事件检测后任务(事件摘要与主题标注)来增强事件聚类效果。我们还评估了不同文本嵌入对聚类结果质量的影响,以确保新闻分类的鲁棒性。此外,我们引入了一种新颖的聚类稳定性评估指数(CSAI)来评估聚类结果的有效性和鲁棒性。CSAI利用多个特征向量,为衡量聚类质量提供了一种新方法。实验结果表明,在事件检测框架中使用LLM嵌入显著改善了结果,在CSAI得分方面表现出更强的鲁棒性。此外,事件检测后任务能够生成有意义的见解,有助于对事件聚类结果进行有效解释。总体而言,我们的实验结果表明,所提出的框架提供了有价值的见解,并可能提高新闻分析和报道的准确性。