CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts

Social media platforms play an essential role in crisis communication, but analyzing crisis-related social media texts is challenging due to their informal nature. Transformer-based pre-trained models like BERT and RoBERTa have shown success in various NLP tasks, but they are not tailored for crisis-related texts. Furthermore, general-purpose sentence encoders are used to generate sentence embeddings, regardless of the textual complexities in crisis-related texts. Advances in applications like text classification, semantic search, and clustering contribute to effective processing of crisis-related texts, which is essential for emergency responders to gain a comprehensive view of a crisis event, whether historical or real-time. To address these gaps in crisis informatics literature, this study introduces CrisisTransformers, an ensemble of pre-trained language models and sentence encoders trained on an extensive corpus of over 15 billion word tokens from tweets associated with more than 30 crisis events, including disease outbreaks, natural disasters, conflicts, and other critical incidents. We evaluate existing models and CrisisTransformers on 18 crisis-specific public datasets. Our pre-trained models outperform strong baselines across all datasets in classification tasks, and our best-performing sentence encoder improves the state-of-the-art by 17.43% in sentence encoding tasks. Additionally, we investigate the impact of model initialization on convergence and evaluate the significance of domain-specific models in generating semantically meaningful sentence embeddings. All models are publicly released (https://huggingface.co/crisistransformers), with the anticipation that they will serve as a robust baseline for tasks involving the analysis of crisis-related social media texts.

翻译：社交媒体平台在危机传播中扮演着关键角色，但由于其非正式性，分析危机相关社交媒体文本颇具挑战。基于Transformer的预训练模型（如BERT和RoBERTa）已在多种自然语言处理任务中取得成功，但并非针对危机相关文本而定制。此外，通用句子编码器用于生成句子嵌入，却未考虑危机文本中的语言复杂性。文本分类、语义搜索和聚类等应用的技术进步有助于高效处理危机相关文本，这对应急响应人员全面了解危机事件（无论是历史事件还是实时事件）至关重要。为填补危机信息学文献中的这些空白，本研究引入了"危机转换器"（CrisisTransformers），这是一个整合了预训练语言模型与句子编码器的集合，其训练语料源自超过150亿词元的推文，涉及包括疾病暴发、自然灾害、冲突及其他重大事件在内的30余个危机事件。我们在18个危机特定公开数据集上评估了现有模型与危机转换器。在分类任务中，我们的预训练模型在所有数据集上均优于强基线模型；在句子编码任务中，最佳句子编码器将最先进水平提升17.43%。此外，我们还探究了模型初始化对收敛的影响，并评估了领域特定模型在生成语义有意义的句子嵌入方面的重要性。所有模型均已公开发布（https://huggingface.co/crisistransformers），期望它们能成为分析危机相关社交媒体文本任务的稳健基线。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日