CrisiText: A dataset of warning messages for LLM training in emergency communication

Effectively identifying threats and mitigating their potential damage during crisis situations, such as natural disasters or violent attacks, is paramount for safeguarding endangered individuals. To tackle these challenges, AI has been used in assisting humans in emergency situations. Still, the use of NLP techniques remains limited and mostly focuses on classification tasks. The significant potential of timely warning message generation using NLG architectures, however, has been largely overlooked. In this paper we present CrisiText, the first large-scale dataset for the generation of warning messages across 13 different types of crisis scenarios. The dataset contains more than 400,000 warning messages (spanning almost 18,000 crisis situations) aimed at assisting civilians during and after such events. To generate the dataset, we started from existing crisis descriptions and created chains of events related to the scenarios. Each event was then paired with a warning message. The generations follow experts' written guidelines to ensure correct terminology and factuality of their suggestions. Additionally, each message is accompanied by three suboptimal warning types to allow for the study of different NLG approaches. To this end, we conducted a series of experiments comparing supervised fine-tuning setups with preference alignment, zero-shot, and few-shot approaches. We further assessed model performance in out-of-distribution scenarios and evaluated the effectiveness of an automatic post-editor.

翻译：在自然灾害或暴力袭击等危机情况下，有效识别威胁并减轻其潜在损害对于保护受威胁个体至关重要。为应对这些挑战，人工智能已被用于协助人类处理紧急情况。然而，自然语言处理技术的应用仍然有限，且主要集中在分类任务上。而利用自然语言生成架构及时生成预警信息的巨大潜力在很大程度上被忽视了。本文提出了CrisiText，这是首个面向13种不同类型危机场景生成预警消息的大规模数据集。该数据集包含超过40万条预警消息（涵盖近1.8万种危机情境），旨在为此类事件期间及之后的民众提供协助。为构建本数据集，我们从现有危机描述出发，创建了与场景相关的事件链。每个事件随后与一条预警消息配对。消息生成遵循专家书面指南，以确保术语准确性和建议的事实性。此外，每条消息均附带三种次优预警类型，以便研究不同的自然语言生成方法。为此，我们进行了一系列实验，比较监督微调设置与偏好对齐、零样本及少样本方法。我们进一步评估了模型在分布外场景下的性能，并检验了自动后编辑器的有效性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日