Social media platforms play an essential role in crisis communication, but analyzing crisis-related social media texts is challenging due to their informal nature. Transformer-based pre-trained models like BERT and RoBERTa have shown success in various NLP tasks, but they are not tailored for crisis-related texts. Furthermore, general-purpose sentence encoders are used to generate sentence embeddings, regardless of the textual complexities in crisis-related texts. Advances in applications like text classification, semantic search, and clustering contribute to effective processing of crisis-related texts, which is essential for emergency responders to gain a comprehensive view of a crisis event, whether historical or real-time. To address these gaps in crisis informatics literature, this study introduces CrisisTransformers, an ensemble of pre-trained language models and sentence encoders trained on an extensive corpus of over 15 billion word tokens from tweets associated with more than 30 crisis events, including disease outbreaks, natural disasters, conflicts, and other critical incidents. We evaluate existing models and CrisisTransformers on 18 crisis-specific public datasets. Our pre-trained models outperform strong baselines across all datasets in classification tasks, and our best-performing sentence encoder improves the state-of-the-art by 17.43% in sentence encoding tasks. Additionally, we investigate the impact of model initialization on convergence and evaluate the significance of domain-specific models in generating semantically meaningful sentence embeddings. All models are publicly released (https://huggingface.co/crisistransformers), with the anticipation that they will serve as a robust baseline for tasks involving the analysis of crisis-related social media texts.
翻译:社交媒体平台在危机传播中扮演着关键角色,但由于其非正式性,分析危机相关社交媒体文本颇具挑战。基于Transformer的预训练模型(如BERT和RoBERTa)已在多种自然语言处理任务中取得成功,但并非针对危机相关文本而定制。此外,通用句子编码器用于生成句子嵌入,却未考虑危机文本中的语言复杂性。文本分类、语义搜索和聚类等应用的技术进步有助于高效处理危机相关文本,这对应急响应人员全面了解危机事件(无论是历史事件还是实时事件)至关重要。为填补危机信息学文献中的这些空白,本研究引入了"危机转换器"(CrisisTransformers),这是一个整合了预训练语言模型与句子编码器的集合,其训练语料源自超过150亿词元的推文,涉及包括疾病暴发、自然灾害、冲突及其他重大事件在内的30余个危机事件。我们在18个危机特定公开数据集上评估了现有模型与危机转换器。在分类任务中,我们的预训练模型在所有数据集上均优于强基线模型;在句子编码任务中,最佳句子编码器将最先进水平提升17.43%。此外,我们还探究了模型初始化对收敛的影响,并评估了领域特定模型在生成语义有意义的句子嵌入方面的重要性。所有模型均已公开发布(https://huggingface.co/crisistransformers),期望它们能成为分析危机相关社交媒体文本任务的稳健基线。