This paper introduces ParlaCAP, a large-scale dataset for analyzing parliamentary agenda setting across Europe, and proposes a cost-effective method for building domain-specific policy topic classifiers. Applying the Comparative Agendas Project (CAP) schema to the multilingual ParlaMint corpus of over 8 million speeches from 28 parliaments of European countries and autonomous regions, we follow a teacher-student framework in which a high-performing large language model (LLM) annotates in-domain training data and a multilingual encoder model is fine-tuned on these annotations for scalable data annotation. We show that this approach produces a classifier tailored to the target domain. Agreement between the LLM and human annotators is comparable to inter-annotator agreement among humans, and the resulting model outperforms existing CAP classifiers trained on manually-annotated but out-of-domain data. In addition to the CAP annotations, the ParlaCAP dataset offers rich speaker and party metadata, as well as sentiment predictions coming from the ParlaSent multilingual transformer model, enabling comparative research on political attention and representation across countries. We illustrate the analytical potential of the dataset with three use cases, examining the distribution of parliamentary attention across policy topics, sentiment patterns in parliamentary speech, and gender differences in policy attention.
翻译:本文介绍了ParlaCAP——一个用于分析欧洲议会议程设置的大规模数据集,并提出了一种构建领域特定政策主题分类器的经济高效方法。通过将比较议程项目(CAP)框架应用于包含28个欧洲国家及自治地区议会超过800万次演讲的多语言ParlaMint语料库,我们采用师生框架:高性能大语言模型(LLM)标注领域内训练数据,多语言编码器模型基于这些标注进行微调,从而实现可扩展的数据标注。研究表明,该方法能生成针对目标领域定制的分类器。LLM与人工标注者之间的一致性达到人类标注者间一致性的可比水平,且所得模型性能优于基于人工标注但领域外数据训练的现有CAP分类器。除CAP标注外,ParlaCAP数据集还提供丰富的演讲者与政党元数据,以及来自ParlaSent多语言Transformer模型的情感预测结果,支持跨国政治关注与代表模式的比较研究。我们通过三个应用案例展示数据集的分析潜力:考察议会关注在不同政策主题间的分布、议会演讲中的情感模式,以及政策关注中的性别差异。