Event data, or structured records of ``who did what to whom'' that are automatically extracted from text, is an important source of data for scholars of international politics. The high cost of developing new event datasets, especially using automated systems that rely on hand-built dictionaries, means that most researchers draw on large, pre-existing datasets such as ICEWS rather than developing tailor-made event datasets optimized for their specific research question. This paper describes a ``bag of tricks'' for efficient, custom event data production, drawing on recent advances in natural language processing (NLP) that allow researchers to rapidly produce customized event datasets. The paper introduces techniques for training an event category classifier with active learning, identifying actors and the recipients of actions in text using large language models and standard machine learning classifiers and pretrained ``question-answering'' models from NLP, and resolving mentions of actors to their Wikipedia article to categorize them. We describe how these techniques produced the new POLECAT global event dataset that is intended to replace ICEWS, along with examples of how scholars can quickly produce smaller, custom event datasets. We publish example code and models to implement our new techniques.
翻译:事件数据,即从文本中自动提取的“谁对谁做了什么”的结构化记录,是国际政治学者重要的数据来源。由于开发新事件数据集(尤其依赖手工构建字典的自动化系统)成本高昂,大多数研究者只能使用诸如ICEWS之类的大型既有数据集,而无法针对具体研究问题定制最优的事件数据集。本文基于自然语言处理领域的最新进展,介绍了一套用于高效、定制化事件数据生成的“实用技巧”。这些技术使研究者能够快速生成定制化的事件数据集。具体包括:利用主动学习训练事件类别分类器;借助大语言模型、标准机器学习分类器及自然语言处理领域预训练的“问答”模型,识别文本中的行为主体及动作接受对象;并通过将行为主体提及项关联至其维基百科条目进行归类。我们阐述了这些技术如何生成旨在替代ICEWS的全球POLECAT新事件数据集,同时举例说明研究者如何快速生成规模较小、定制化的事件数据集。本文公开了实现这些新技术的示例代码与模型。