The popularity of social media makes politicians use it for political advertisement. Therefore, social media is full of electoral agitation (electioneering), especially during the election campaigns. The election administration cannot track the spread and quantity of messages that count as agitation under the election code. It addresses a crucial problem, while also uncovering a niche that has not been effectively targeted so far. Hence, we present the first publicly open data set for detecting electoral agitation in the Polish language. It contains 6,112 human-annotated tweets tagged with four legally conditioned categories. We achieved a 0.66 inter-annotator agreement (Cohen's kappa score). An additional annotator resolved the mismatches between the first two improving the consistency and complexity of the annotation process. The newly created data set was used to fine-tune a Polish Language Model called HerBERT (achieving a 68% F1 score). We also present a number of potential use cases for such data sets and models, enriching the paper with an analysis of the Polish 2020 Presidential Election on Twitter.
翻译:社交媒体的普及使得政治人物利用其进行政治宣传。因此,社交媒体上充斥着选举煽动(竞选宣传),尤其在选举活动期间。选举管理机构无法追踪符合选举法规定的煽动性信息的传播范围与数量。这揭示了一个尚未被有效针对的关键问题与空白领域。为此,我们发布了首个用于检测波兰语选举煽动的公开数据集。该数据集包含6,112条经过人工标注的推文,并附有四个法律条件类别标签。我们达成了0.66的标注者间一致性(Cohen's kappa系数)。通过第三位标注者解决前两位标注者的分歧,进一步提升了标注过程的一致性与复杂性。该新创建的数据集被用于微调波兰语语言模型HerBERT(取得68%的F1分数)。我们还展示了此类数据集及模型的若干潜在应用场景,并结合2020年波兰总统选举推特分析丰富了论文内容。