The increasing amount of political debates and politics-related discussions calls for the definition of novel computational methods to automatically analyse such content with the final goal of lightening up political deliberation to citizens. However, the specificity of the political language and the argumentative form of these debates (employing hidden communication strategies and leveraging implicit arguments) make this task very challenging, even for current general-purpose pre-trained Language Models (LMs). To address this, we introduce a novel pre-trained LM for political discourse language called RooseBERT. Pre-training a LM on a specialised domain presents different technical and linguistic challenges, requiring extensive computational resources and large-scale data. RooseBERT has been trained on large political debate and speech corpora (11GB) in English. To evaluate its performances, we fine-tuned it on multiple downstream tasks related to political debate analysis, i.e., stance detection, sentiment analysis, argument component detection and classification, argument relation prediction and classification, policy classification, named entity recognition (NER). Our results show improvements over general-purpose LMs on the majority of these tasks, highlighting how domain-specific pre-training enhances performance in political debate analysis. We release RooseBERT for the research community.
翻译:随着政治辩论及政治相关讨论的数量日益增长,亟需定义新的计算方法来自动分析此类内容,最终目标是帮助公民更清晰地理解政治讨论。然而,政治语言的特殊性以及这些辩论中采用的论证形式(运用隐藏的沟通策略并利用隐性论证)使得这一任务极具挑战性,即便是当前通用的预训练语言模型(LM)也难以胜任。为此,我们引入了一种针对政治话语领域进行预训练的新语言模型,名为RooseBERT。在特定领域预训练语言模型面临技术和语言学上的不同挑战,需要大量的计算资源和大规模数据。RooseBERT基于大型英文政治辩论及演讲语料库(11GB)进行训练。为了评估其性能,我们在多项与政治辩论分析相关的下游任务上对其进行了微调,包括立场检测、情感分析、论证组件检测与分类、论证关系预测与分类、政策分类以及命名实体识别(NER)。我们的结果表明,在大多数任务上,RooseBERT相比通用语言模型均有性能提升,凸显了领域特定预训练在增强政治辩论分析能力方面的效果。我们将RooseBERT开放给研究社区使用。