Conflict scholars have used rule-based approaches to extract information about political violence from news reports and texts. Recent Natural Language Processing developments move beyond rigid rule-based approaches. We review our recent ConfliBERT language model (Hu et al. 2022) to process political and violence related texts. The model can be used to extract actor and action classifications from texts about political conflict. When fine-tuned, results show that ConfliBERT has superior performance in accuracy, precision and recall over other large language models (LLM) like Google's Gemma 2 (9B), Meta's Llama 3.1 (7B), and Alibaba's Qwen 2.5 (14B) within its relevant domains. It is also hundreds of times faster than these more generalist LLMs. These results are illustrated using texts from the BBC, re3d, and the Global Terrorism Dataset (GTD).
翻译:冲突学者长期以来采用基于规则的方法从新闻报道和文本中提取有关政治暴力的信息。近年来,自然语言处理的发展已超越僵化的基于规则方法。本文回顾了我们最近提出的用于处理政治及暴力相关文本的ConfliBERT语言模型(Hu等人,2022)。该模型可用于从政治冲突相关文本中提取行为主体与行动分类。微调后的结果表明,在其相关领域内,ConfliBERT在准确性、精确度和召回率上均优于其他大型语言模型,如Google的Gemma 2(9B)、Meta的Llama 3.1(7B)和阿里巴巴的Qwen 2.5(14B),并且其处理速度比这些更通用的LLM快数百倍。这些结果通过使用来自BBC、re3d以及全球恐怖主义数据库(GTD)的文本得到了验证。