CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts

Social media platforms play an essential role in crisis communication, but analyzing crisis-related social media texts is challenging due to their informal nature. Transformer-based pre-trained models like BERT and RoBERTa have shown success in various NLP tasks, but they are not tailored for crisis-related texts. Furthermore, general-purpose sentence encoders are used to generate sentence embeddings, regardless of the textual complexities in crisis-related texts. Advances in applications like text classification, semantic search, and clustering contribute to effective processing of crisis-related texts, which is essential for emergency responders to gain a comprehensive view of a crisis event, whether historical or real-time. To address these gaps in crisis informatics literature, this study introduces CrisisTransformers, an ensemble of pre-trained language models and sentence encoders trained on an extensive corpus of over 15 billion word tokens from tweets associated with more than 30 crisis events, including disease outbreaks, natural disasters, conflicts, and other critical incidents. We evaluate existing models and CrisisTransformers on 18 crisis-specific public datasets. Our pre-trained models outperform strong baselines across all datasets in classification tasks, and our best-performing sentence encoder improves the state-of-the-art by 17.43% in sentence encoding tasks. Additionally, we investigate the impact of model initialization on convergence and evaluate the significance of domain-specific models in generating semantically meaningful sentence embeddings. All models are publicly released (https://huggingface.co/crisistransformers), with the anticipation that they will serve as a robust baseline for tasks involving the analysis of crisis-related social media texts.

翻译：社交媒体平台在危机沟通中发挥着关键作用，但危机相关社交媒体文本的非正式特性使其分析面临挑战。基于Transformer的预训练模型（如BERT和RoBERTa）已在多种自然语言处理任务中取得成功，但这些模型并非针对危机文本定制。此外，通用句子编码器被用于生成句子嵌入，却忽略了危机文本的复杂性。在文本分类、语义搜索和聚类等应用中的进展，有助于有效处理危机文本，这对应急响应人员全面理解危机事件（无论是历史事件还是实时事件）至关重要。为填补危机信息学研究中的这些空白，本研究提出危机变换器——一个由预训练语言模型和句子编码器集成的框架，该框架基于包含超过150亿词令牌的海量语料库进行训练，该语料库来自与30余个危机事件（包括疾病暴发、自然灾害、冲突及其他重大事件）相关的推文。我们在18个危机特定公开数据集上评估了现有模型与危机变换器。我们的预训练模型在分类任务中所有数据集上均显著优于强基线模型，最佳句子编码器在句子编码任务中将现有最优水平提升了17.43%。此外，我们探究了模型初始化对收敛性的影响，并评估了领域特定模型在生成语义有意义的句子嵌入方面的重要性。所有模型已公开发布（https://huggingface.co/crisistransformers），预期其将成为分析危机相关社交媒体文本任务的稳健基线。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日