Social platforms have emerged as a crucial platform for disseminating and discussing information about real-life events, which offers an excellent opportunity for early detection of newsworthy events. However, most existing approaches for event detection solely exploit keyword burstiness or network structures to detect hot events. Thus, they often fail to identify emerging social events before reaching a trending state regarding the challenging nature of events and social data. Social data, e.g., tweets, is characterized by misspellings, incompleteness, ambiguity, and irregular language, as well as variation in aspects of opinions. Moreover, learning the evolving characteristics of the events utilizing limited contextual knowledge is almost infeasible for machine learning models. To address these problems, in this paper, we propose a framework that exploits the lexical, semantic, and contextual representations of streaming social data. In particular, we leverage contextual knowledge to detect semantically related tweets in their earliest emergence and enhance the quality of produced clusters. We next produce a cluster chains for each event to show the evolving variation of the event through time. We conducted extensive experiments to evaluate our framework, validating the effectiveness of the proposed framework in detecting and distinguishing social events.
翻译:社交平台已成为传播和讨论现实事件信息的关键平台,为早期检测具有新闻价值的事件提供了良好契机。然而,现有的事件检测方法大多仅利用关键词突发性或网络结构来检测热门事件,因此常因事件与社交数据的固有挑战性而难以在趋势形成前识别出新兴社交事件。社交数据(如推文)具有拼写错误、不完整性、歧义性、语言不规则性以及观点维度多样性等特点。此外,利用有限的上下文知识学习事件的演化特征对机器学习模型而言几乎不可行。为解决上述问题,本文提出一个框架,利用流式社交数据的词汇、语义和上下文表征。具体而言,我们利用上下文知识在最早期检测语义相关的推文,并提升生成聚类的质量;随后为每个事件构建聚类链以展现事件随时间的演化动态。通过大量实验评估该框架,验证了其在检测与区分社交事件方面的有效性。