Semi-automatic Generation of Multilingual Datasets for Stance Detection in Twitter

Popular social media networks provide the perfect environment to study the opinions and attitudes expressed by users. While interactions in social media such as Twitter occur in many natural languages, research on stance detection (the position or attitude expressed with respect to a specific topic) within the Natural Language Processing field has largely been done for English. Although some efforts have recently been made to develop annotated data in other languages, there is a telling lack of resources to facilitate multilingual and crosslingual research on stance detection. This is partially due to the fact that manually annotating a corpus of social media texts is a difficult, slow and costly process. Furthermore, as stance is a highly domain- and topic-specific phenomenon, the need for annotated data is specially demanding. As a result, most of the manually labeled resources are hindered by their relatively small size and skewed class distribution. This paper presents a method to obtain multilingual datasets for stance detection in Twitter. Instead of manually annotating on a per tweet basis, we leverage user-based information to semi-automatically label large amounts of tweets. Empirical monolingual and cross-lingual experimentation and qualitative analysis show that our method helps to overcome the aforementioned difficulties to build large, balanced and multilingual labeled corpora. We believe that our method can be easily adapted to easily generate labeled social media data for other Natural Language Processing tasks and domains.

翻译：大众社交媒体网络为研究用户表达的观点和态度提供了完美的环境。在诸如Twitter等社交媒体中的互动以许多自然语言出现,但在自然语言处理领域,对立场探测(对特定主题表示的立场或态度)的研究主要针对英语。虽然最近已作出一些努力,以其他语言编制附加说明的数据,但显然缺乏资源,无法促进多语种和跨语种的姿态检测研究。这部分原因在于手动批注一系列社交媒体文本是一个困难、缓慢和昂贵的过程。此外,由于立场是一种高度的域和专题特有的现象,因此对附加说明数据的需求特别高。因此,大多数人工标记的资源由于规模较小和偏斜的班级分布而受阻。本文为在推特上获取多语种数据集以方便地进行定位提供了一种方法。我们利用用户信息对大量推特进行半自动贴标签,而实证单语和跨语种的实验和定性分析表明,我们的方法有助于克服上述媒体在构建大型多语种标签方面的困难。我们很容易地相信,我们为创建了其他语言标签所需的大语言标签。