Social media platforms play an essential role in crisis communication, but analyzing crisis-related social media texts is challenging due to their informal nature. Transformer-based pre-trained models like BERT and RoBERTa have shown success in various NLP tasks, but they are not tailored for crisis-related texts. Furthermore, general-purpose sentence encoders are used to generate sentence embeddings, regardless of the textual complexities in crisis-related texts. Advances in applications like text classification, semantic search, and clustering contribute to effective processing of crisis-related texts, which is essential for emergency responders to gain a comprehensive view of a crisis event, whether historical or real-time. To address these gaps in crisis informatics literature, this study introduces CrisisTransformers, an ensemble of pre-trained language models and sentence encoders trained on an extensive corpus of over 15 billion word tokens from tweets associated with more than 30 crisis events, including disease outbreaks, natural disasters, conflicts, and other critical incidents. We evaluate existing models and CrisisTransformers on 18 crisis-specific public datasets. Our pre-trained models outperform strong baselines across all datasets in classification tasks, and our best-performing sentence encoder improves the state-of-the-art by 17.43% in sentence encoding tasks. Additionally, we investigate the impact of model initialization on convergence and evaluate the significance of domain-specific models in generating semantically meaningful sentence embeddings. All models are publicly released (https://huggingface.co/crisistransformers), with the anticipation that they will serve as a robust baseline for tasks involving the analysis of crisis-related social media texts.
翻译:社交媒体平台在危机沟通中发挥着关键作用,但危机相关社交媒体文本的非正式特性使其分析面临挑战。基于Transformer的预训练模型(如BERT和RoBERTa)已在多种自然语言处理任务中取得成功,但这些模型并非针对危机文本定制。此外,通用句子编码器被用于生成句子嵌入,却忽略了危机文本的复杂性。在文本分类、语义搜索和聚类等应用中的进展,有助于有效处理危机文本,这对应急响应人员全面理解危机事件(无论是历史事件还是实时事件)至关重要。为填补危机信息学研究中的这些空白,本研究提出危机变换器——一个由预训练语言模型和句子编码器集成的框架,该框架基于包含超过150亿词令牌的海量语料库进行训练,该语料库来自与30余个危机事件(包括疾病暴发、自然灾害、冲突及其他重大事件)相关的推文。我们在18个危机特定公开数据集上评估了现有模型与危机变换器。我们的预训练模型在分类任务中所有数据集上均显著优于强基线模型,最佳句子编码器在句子编码任务中将现有最优水平提升了17.43%。此外,我们探究了模型初始化对收敛性的影响,并评估了领域特定模型在生成语义有意义的句子嵌入方面的重要性。所有模型已公开发布(https://huggingface.co/crisistransformers),预期其将成为分析危机相关社交媒体文本任务的稳健基线。