Following recent policy changes by X (Twitter) and other social media platforms, user interaction data has become increasingly difficult to access. These restrictions are impeding robust research pertaining to social and political phenomena online, which is critical due to the profound impact social media platforms may have on our societies. Here, we investigate the reliability of polarization measures obtained from different samples of social media data by studying the structural polarization of the Polish political debate on Twitter over a 24-hour period. First, we show that the political discussion on Twitter is only a small subset of the wider Twitter discussion. Second, we find that large samples can be representative of the whole political discussion on a platform, but small samples consistently fail to accurately reflect the true structure of polarization online. Finally, we demonstrate that keyword-based samples can be representative if keywords are selected with great care, but that poorly selected keywords can result in substantial political bias in the sampled data. Our findings demonstrate that it is not possible to measure polarization in a reliable way with small, sampled datasets, highlighting why the current lack of research data is so problematic, and providing insight into the practical implementation of the European Union's Digital Service Act which aims to improve researchers' access to social media data.
翻译:随着X(Twitter)及其他社交媒体平台近期政策调整,用户交互数据的获取日益困难。这些限制正阻碍着针对网络社会政治现象的稳健研究——鉴于社交媒体平台可能对社会产生的深远影响,此类研究至关重要。本文通过研究波兰政治辩论在Twitter上24小时内的结构极化,探讨了从不同社交媒体数据样本中获取的极化测量指标的可靠性。首先,我们证明Twitter上的政治讨论仅是更广泛Twitter讨论中的微小子集。其次,我们发现大样本能够代表平台上的整体政治讨论,但小样本始终无法准确反映网络极化的真实结构。最后,我们论证基于关键词的样本在精心选择关键词时可能具有代表性,但选择不当的关键词会导致样本数据产生显著的政治偏差。我们的研究结果表明,无法通过小型采样数据集可靠地测量极化现象,这揭示了当前研究数据匮乏为何如此棘手,并为欧盟《数字服务法案》的实际实施提供了见解——该法案旨在改善研究者获取社交媒体数据的途径。