Build, Borrow, or Just Fine-Tune? A Political Scientist's Guide to Choosing NLP Models

Political scientists increasingly face a consequential choice when adopting natural language processing tools: build a domain-specific model from scratch, borrow and adapt an existing one, or simply fine-tune a general-purpose model on task data? Each approach occupies a different point on the spectrum of performance, cost, and required expertise, yet the discipline has offered little empirical guidance on how to navigate this trade-off. This paper provides such guidance. Using conflict event classification as a test case, I fine-tune ModernBERT on the Global Terrorism Database (GTD) to create Confli-mBERT and systematically compare it against ConfliBERT, a domain-specific pretrained model that represents the current gold standard. Confli-mBERT achieves 75.46% accuracy compared to ConfliBERT's 79.34%. Critically, the four-percentage-point gap is not uniform: on high-frequency attack types such as Bombing/Explosion (F1 = 0.95 vs. 0.96) and Kidnapping (F1 = 0.92 vs. 0.91), the models are nearly indistinguishable. Performance differences concentrate in rare event categories comprising fewer than 2% of all incidents. I use these findings to develop a practical decision framework for political scientists considering any NLP-assisted research task: when does the research question demand a specialized model, and when does an accessible fine-tuned alternative suffice? The answer, I argue, depends not on which model is "better" in the abstract, but on the specific intersection of class prevalence, error tolerance, and available resources. The model, training code, and data are publicly available on Hugging Face.

翻译：政治科学家在采用自然语言处理工具时，日益面临一个关键抉择：是从零构建领域专用模型，借用并改造现有模型，还是直接在任务数据上微调通用模型？每种方法都在性能、成本与所需专业知识的权衡谱系中占据不同位置，然而学界对此尚未提供充分的实证指导。本文旨在填补这一空白。以冲突事件分类为测试案例，我在全球恐怖主义数据库（GTD）上对ModernBERT进行微调，构建了Confli-mBERT模型，并系统性地将其与当前黄金标准——领域专用预训练模型ConfliBERT进行对比。Confli-mBERT取得了75.46%的准确率，而ConfliBERT为79.34%。关键发现在于，这4个百分点的性能差距并非均匀分布：在高频攻击类型如爆炸/轰炸（F1 = 0.95 vs. 0.96）和绑架（F1 = 0.92 vs. 0.91）上，两者表现几乎无差异；性能差异主要集中于占比不足总事件数2%的罕见事件类别。基于这些发现，我为考虑采用NLP辅助研究任务的政治科学家构建了一个实用决策框架：何时研究问题必须使用专用模型？何时可访问的微调替代方案已足够？本文主张，答案不应抽象地讨论何种模型"更优"，而应取决于类别分布、误差容忍度与可用资源三者的具体交汇点。模型、训练代码与数据已在Hugging Face平台开源发布。