A Named Entity Recognition and Topic Modeling-based Solution for Locating and Better Assessment of Natural Disasters in Social Media

Over the last decade, similar to other application domains, social media content has been proven very effective in disaster informatics. However, due to the unstructured nature of the data, several challenges are associated with disaster analysis in social media content. To fully explore the potential of social media content in disaster informatics, access to relevant content and the correct geo-location information is very critical. In this paper, we propose a three-step solution to tackling these challenges. Firstly, the proposed solution aims to classify social media posts into relevant and irrelevant posts followed by the automatic extraction of location information from the posts' text through Named Entity Recognition (NER) analysis. Finally, to quickly analyze the topics covered in large volumes of social media posts, we perform topic modeling resulting in a list of top keywords, that highlight the issues discussed in the tweet. For the Relevant Classification of Twitter Posts (RCTP), we proposed a merit-based fusion framework combining the capabilities of four different models namely BERT, RoBERTa, Distil BERT, and ALBERT obtaining the highest F1-score of 0.933 on a benchmark dataset. For the Location Extraction from Twitter Text (LETT), we evaluated four models namely BERT, RoBERTa, Distil BERTA, and Electra in an NER framework obtaining the highest F1-score of 0.960. For topic modeling, we used the BERTopic library to discover the hidden topic patterns in the relevant tweets. The experimental results of all the components of the proposed end-to-end solution are very encouraging and hint at the potential of social media content and NLP in disaster management.

翻译：过去十年中，与其他应用领域类似，社交媒体内容在灾害信息学中已被证明非常有效。然而，由于数据的非结构化特性，对社交媒体内容进行灾害分析面临诸多挑战。为充分挖掘社交媒体内容在灾害信息学中的潜力，获取相关内容和准确的地理位置信息至关重要。本文提出了一种三步解决方案以应对这些挑战。首先，该方案将社交媒体帖子分类为相关与不相关帖子，随后通过命名实体识别（NER）分析自动提取帖子文本中的地理位置信息。最后，为快速分析大量社交媒体帖子中涵盖的主题，我们进行主题建模，生成一组突出讨论话题的关键词列表。在相关推特帖子分类（RCTP）任务中，我们提出了一种基于融合机制的框架，结合BERT、RoBERTa、DistilBERT和ALBERT四种模型的优势，在基准数据集上获得了0.933的最高F1分数。在推特文本位置提取（LETT）任务中，我们在NER框架下评估了BERT、RoBERTa、DistilBERT和Electra四种模型，取得了0.960的最高F1分数。对于主题建模，我们使用BERTopic库发现相关推文中的潜在主题模式。所提出的端到端解决方案所有组件的实验结果令人鼓舞，展现了社交媒体内容与自然语言处理在灾害管理中的潜力。