Few-shot learning for automated content analysis: Efficient coding of arguments and claims in the debate on arms deliveries to Ukraine

Pre-trained language models (PLM) based on transformer neural networks developed in the field of natural language processing (NLP) offer great opportunities to improve automatic content analysis in communication science, especially for the coding of complex semantic categories in large datasets via supervised machine learning. However, three characteristics so far impeded the widespread adoption of the methods in the applying disciplines: the dominance of English language models in NLP research, the necessary computing resources, and the effort required to produce training data to fine-tune PLMs. In this study, we address these challenges by using a multilingual transformer model in combination with the adapter extension to transformers, and few-shot learning methods. We test our approach on a realistic use case from communication science to automatically detect claims and arguments together with their stance in the German news debate on arms deliveries to Ukraine. In three experiments, we evaluate (1) data preprocessing strategies and model variants for this task, (2) the performance of different few-shot learning methods, and (3) how well the best setup performs on varying training set sizes in terms of validity, reliability, replicability and reproducibility of the results. We find that our proposed combination of transformer adapters with pattern exploiting training provides a parameter-efficient and easily shareable alternative to fully fine-tuning PLMs. It performs on par in terms of validity, while overall, provides better properties for application in communication studies. The results also show that pre-fine-tuning for a task on a near-domain dataset leads to substantial improvement, in particular in the few-shot setting. Further, the results indicate that it is useful to bias the dataset away from the viewpoints of specific prominent individuals.

翻译：摘要：基于自然语言处理（NLP）领域开发的Transformer神经网络预训练语言模型（PLM），为提升传播科学中的自动内容分析能力提供了重要机遇，尤其适用于通过监督机器学习对大规模数据集中的复杂语义类别进行编码。然而，当前存在三个特征阻碍了这些方法在应用学科的广泛普及：NLP研究中英语语言模型的主导地位、必要的计算资源需求，以及生产训练数据以微调PLM所需的工作量。本研究通过采用多语言Transformer模型，结合适配器扩展技术与小样本学习方法，着力应对上述挑战。我们以传播科学中的真实案例为测试场景，自动检测德国新闻关于乌克兰武器交付辩论中的主张与论点及其立场。通过三项实验，我们评估了：(1) 针对该任务的数据预处理策略与模型变体；(2) 不同小样本学习方法的性能；(3) 最佳配置在不同训练集规模下对结果效度、信度、可复现性与可复制性的表现。研究发现，所提出的Transformer适配器与模式利用训练相结合的方法，可作为全参数微调PLM的高效参数化且易于共享的替代方案。其在效度方面表现相当，整体上为传播学研究应用提供了更优特性。结果还表明，在近域数据集上对任务进行预微调能带来显著改进，尤其在小样本场景下。此外，研究指出偏向性构建数据集使其偏离特定知名人物的观点具有实用性。