Classifying multilingual party manifestos: Domain transfer across country, time, and genre

Annotating costs of large corpora are still one of the main bottlenecks in empirical social science research. On the one hand, making use of the capabilities of domain transfer allows re-using annotated data sets and trained models. On the other hand, it is not clear how well domain transfer works and how reliable the results are for transfer across different dimensions. We explore the potential of domain transfer across geographical locations, languages, time, and genre in a large-scale database of political manifestos. First, we show the strong within-domain classification performance of fine-tuned transformer models. Second, we vary the genre of the test set across the aforementioned dimensions to test for the fine-tuned models' robustness and transferability. For switching genres, we use an external corpus of transcribed speeches from New Zealand politicians while for the other three dimensions, custom splits of the Manifesto database are used. While BERT achieves the best scores in the initial experiments across modalities, DistilBERT proves to be competitive at a lower computational expense and is thus used for further experiments across time and country. The results of the additional analysis show that (Distil)BERT can be applied to future data with similar performance. Moreover, we observe (partly) notable differences between the political manifestos of different countries of origin, even if these countries share a language or a cultural background.

翻译：大规模语料库的标注成本仍是实证社会科学研究的主要瓶颈之一。一方面，利用领域迁移能力可以复用已标注数据集和训练好的模型。另一方面，领域迁移的效果如何，以及其在不同维度上的迁移结果可靠性尚不明确。本研究基于大规模政党宣言数据库，探索了跨地理位置、语言、时间和体裁的领域迁移潜力。首先，我们展示了微调后的Transformer模型在领域内分类中的优异性能。其次，我们通过上述维度对测试集的体裁进行变换，检验微调模型的鲁棒性和可迁移性。在体裁切换实验中，我们使用了新西兰政治家演讲转录的外部语料库；而在其他三个维度的实验中，则使用了宣言数据库的自定义数据集划分。虽然BERT在初始跨模态实验中取得了最佳分数，但DistilBERT在计算成本更低的情况下展现了竞争力，因此被用于后续跨时间和国家的实验。进一步分析表明，(Distil)BERT可以应用于未来数据并保持相似性能。此外，我们观察到不同国家来源的政党宣言之间存在（部分）显著差异，即便这些国家共享语言或文化背景。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日