The extraction of critical information from crime-related documents is a crucial task for law enforcement agencies. Named-Entity Recognition (NER) can perform this task in extracting information about the crime, the criminal, or law enforcement agencies involved. However, there is a considerable lack of adequately annotated data on general real-world crime scenarios. To address this issue, we present CrimeNER, a case-study of Crime-related zero- and Few-Shot NER, and a general Crime-related Named-Entity Recognition database (CrimeNERdb) consisting of more than 1.5k annotated documents for the NER task extracted from public reports on terrorist attacks and the U.S. Department of Justice's press notes. We define 5 types of coarse crime entity and a total of 22 types of fine-grained entity. We address the quality of the case-study and the annotated data with experiments on Zero and Few-Shot settings with State-of-the-Art NER models as well as generalist and commonly used Large Language Models.
翻译:从犯罪相关文档中提取关键信息是执法机构的一项重要任务。命名实体识别(NER)可通过提取犯罪、罪犯或涉案执法机构的信息来完成此任务。然而,针对一般现实犯罪场景的充分标注数据存在显著缺失。为解决这一问题,我们提出CrimeNER——一个犯罪相关零样本与少样本NER的案例研究,并构建了通用的犯罪相关命名实体识别数据库(CrimeNERdb)。该数据库包含超过1.5k份标注文档,用于NER任务,文档来源于恐怖袭击公开报告和美国司法部新闻稿。我们定义了5种粗粒度犯罪实体类型及共计22种细粒度实体类型。通过采用最先进的NER模型以及通用常用的大型语言模型,在零样本与少样本设定下进行实验,我们对案例研究与标注数据的质量进行了验证。