The news landscape is continuously evolving, with an ever-increasing volume of information from around the world. Automated event detection within this vast data repository is essential for monitoring, identifying, and categorizing significant news occurrences across diverse platforms. This paper presents an event detection framework that leverages Large Language Models (LLMs) combined with clustering analysis to detect news events from the Global Database of Events, Language, and Tone (GDELT). The framework enhances event clustering through both pre-event detection tasks (keyword extraction and text embedding) and post-event detection tasks (event summarization and topic labeling). We also evaluate the impact of various textual embeddings on the quality of clustering outcomes, ensuring robust news categorization. Additionally, we introduce a novel Cluster Stability Assessment Index (CSAI) to assess the validity and robustness of clustering results. CSAI utilizes latent feature vectors to provide a new way of measuring clustering quality. Our experiments indicate that combining LLM embeddings with clustering algorithms yields the best results, demonstrating greater robustness in terms of CSAI scores. Moreover, post-event detection tasks generate meaningful insights, facilitating effective interpretation of event clustering results. Overall, our experimental results indicate that the proposed framework offers valuable insights and could enhance the accuracy and depth of news reporting.
翻译:新闻领域持续演变,全球信息量不断增长。在这一庞大的数据存储库中进行自动化事件检测,对于跨平台监测、识别和分类重大新闻事件至关重要。本文提出一种事件检测框架,该框架利用大型语言模型(LLMs)结合聚类分析,从全球事件、语言与情感数据库(GDELT)中检测新闻事件。该框架通过事件前检测任务(关键词提取与文本嵌入)和事件后检测任务(事件摘要与主题标注)来增强事件聚类效果。我们还评估了不同文本嵌入对聚类结果质量的影响,以确保新闻分类的鲁棒性。此外,我们引入了一种新颖的聚类稳定性评估指标(CSAI)来评估聚类结果的有效性和稳健性。CSAI利用潜在特征向量,提供了一种衡量聚类质量的新方法。实验表明,将LLM嵌入与聚类算法相结合能获得最佳结果,在CSAI得分方面表现出更强的鲁棒性。此外,事件后检测任务能够生成有意义的见解,有助于对事件聚类结果进行有效解释。总体而言,我们的实验结果表明,所提出的框架提供了有价值的见解,并可能提高新闻报道的准确性和深度。