Annotating costs of large corpora are still one of the main bottlenecks in empirical social science research. On the one hand, making use of the capabilities of domain transfer allows re-using annotated data sets and trained models. On the other hand, it is not clear how well domain transfer works and how reliable the results are for transfer across different dimensions. We explore the potential of domain transfer across geographical locations, languages, time, and genre in a large-scale database of political manifestos. First, we show the strong within-domain classification performance of fine-tuned transformer models. Second, we vary the genre of the test set across the aforementioned dimensions to test for the fine-tuned models' robustness and transferability. For switching genres, we use an external corpus of transcribed speeches from New Zealand politicians while for the other three dimensions, custom splits of the Manifesto database are used. While BERT achieves the best scores in the initial experiments across modalities, DistilBERT proves to be competitive at a lower computational expense and is thus used for further experiments across time and country. The results of the additional analysis show that (Distil)BERT can be applied to future data with similar performance. Moreover, we observe (partly) notable differences between the political manifestos of different countries of origin, even if these countries share a language or a cultural background.
翻译:大规模语料库的标注成本仍是实证社会科学研究的主要瓶颈之一。一方面,利用领域迁移能力可以复用已标注数据集和训练好的模型。另一方面,领域迁移的效果如何,以及其在不同维度上的迁移结果可靠性尚不明确。本研究基于大规模政党宣言数据库,探索了跨地理位置、语言、时间和体裁的领域迁移潜力。首先,我们展示了微调后的Transformer模型在领域内分类中的优异性能。其次,我们通过上述维度对测试集的体裁进行变换,检验微调模型的鲁棒性和可迁移性。在体裁切换实验中,我们使用了新西兰政治家演讲转录的外部语料库;而在其他三个维度的实验中,则使用了宣言数据库的自定义数据集划分。虽然BERT在初始跨模态实验中取得了最佳分数,但DistilBERT在计算成本更低的情况下展现了竞争力,因此被用于后续跨时间和国家的实验。进一步分析表明,(Distil)BERT可以应用于未来数据并保持相似性能。此外,我们观察到不同国家来源的政党宣言之间存在(部分)显著差异,即便这些国家共享语言或文化背景。