High-resolution event data on armed conflict and related processes have revolutionized the study of political contention with datasets like UCDP GED, ACLED etc. However, most of these datasets limit themselves to collecting spatio-temporal (high-resolution) and intensity data. Information on dynamics, such as targets, tactics, purposes etc. are rarely collected owing to the extreme workload of collecting data. However, most datasets rely on a rich corpus of textual data allowing further mining of further information connected to each event. This paper proposes one such approach that is inexpensive and high performance, leveraging active learning - an iterative process of improving a machine learning model based on sequential (guided) human input. Active learning is employed to then step-wise train (fine-tuning) of a large, encoder-only language model adapted for extracting sub-classes of events relating to conflict dynamics. The approach shows performance similar to human (gold-standard) coding while reducing the amount of required human annotation by as much as 99%.
翻译:高分辨率武装冲突及相关过程事件数据(如UCDP GED、ACLED等数据集)已彻底改变了政治抗争研究。然而,大多数此类数据集仅限于收集时空(高分辨率)与强度数据。由于数据收集工作量极大,关于目标、战术、目的等动态信息极少被收录。但多数数据集依托于丰富的文本语料库,这使得进一步挖掘每个事件关联信息成为可能。本文提出一种低成本且高性能的方法,利用主动学习——一种基于序列化(引导式)人工输入逐步优化机器学习模型的迭代过程。通过主动学习,对适用于提取冲突动态相关事件子类的大型纯编码器语言模型进行逐步训练(微调)。实验表明,该方法在将所需人工标注量减少高达99%的同时,实现了与人类(金标准)编码相当的性能。