Large language and vision models have transformed how social movements scholars identify protest and extract key protest attributes from multi-modal data such as texts, images, and videos. This article documents how we fine-tuned two large pretrained transformer models, including longformer and swin-transformer v2, to infer potential protests in news articles using textual and imagery data. First, the longformer model was fine-tuned using the Dynamic of Collective Action (DoCA) Corpus. We matched the New York Times articles with the DoCA database to obtain a training dataset for downstream tasks. Second, the swin-transformer v2 models was trained on UCLA-protest imagery data. UCLA-protest project contains labeled imagery data with information such as protest, violence, and sign. Both fine-tuned models will be available via \url{https://github.com/Joshzyj/llvms4protest}. We release this short technical report for social movement scholars who are interested in using LLVMs to infer protests in textual and imagery data.
翻译:大型语言与视觉模型已改变了社会运动研究者识别抗议活动并从文本、图像和视频等多模态数据中提取关键抗议属性的方式。本文记录了如何微调两个大型预训练Transformer模型(包括Longformer和Swin-Transformer v2),以通过文本和图像数据推断新闻文章中的潜在抗议活动。首先,使用集体行动动态(DoCA)语料库对Longformer模型进行微调。我们将《纽约时报》文章与DoCA数据库进行匹配,以获得下游任务的训练数据集。其次,基于UCLA抗议图像数据训练Swin-Transformer v2模型。UCLA抗议项目包含带有抗议、暴力和标语等信息的标注图像数据。两个微调模型将通过\url{https://github.com/Joshzyj/llvms4protest}开放获取。我们发布此简短技术报告,供有兴趣使用LLVMs从文本和图像数据中推断抗议活动的社会运动研究者参考。