The Impact of Data Persistence Bias on Social Media Studies

Social media studies often collect data retrospectively to analyze public opinion. Social media data may decay over time and such decay may prevent the collection of the complete dataset. As a result, the collected dataset may differ from the complete dataset and the study may suffer from data persistence bias. Past research suggests that the datasets collected retrospectively are largely representative of the original dataset in terms of textual content. However, no study analyzed the impact of data persistence bias on social media studies such as those focusing on controversial topics. In this study, we analyze the data persistence and the bias it introduces on the datasets of three types: controversial topics, trending topics, and framing of issues. We report which topics are more likely to suffer from data persistence among these datasets. We quantify the data persistence bias using the change in political orientation, the presence of potentially harmful content and topics as measures. We found that controversial datasets are more likely to suffer from data persistence and they lean towards the political left upon recollection. The turnout of the data that contain potentially harmful content is significantly lower on non-controversial datasets. Overall, we found that the topics promoted by right-aligned users are more likely to suffer from data persistence. Account suspensions are the primary factor contributing to data removals, if not the only one. Our results emphasize the importance of accounting for the data persistence bias by collecting the data in real time when the dataset employed is vulnerable to data persistence bias.

翻译：社交媒体研究常采用回顾性数据收集方法分析公众舆论。社交媒体数据可能随时间推移而衰减，这种衰减可能导致无法获取完整数据集。因此，所收集的数据集可能与完整数据集存在差异，研究可能面临数据持久性偏差问题。过往研究表明，基于文本内容而言，回顾性收集的数据集大体上能代表原始数据集。然而，尚无研究针对争议性话题等特定社交媒体研究议题，分析数据持久性偏差的具体影响。本研究从三个维度分析数据持久性及其引发的偏差：争议性话题、热门话题及议题框架。我们揭示了上述数据集中哪些话题更易受数据持久性影响，并通过政治倾向变化、潜在有害内容及话题的存在性等指标量化数据持久性偏差。研究发现，争议性数据集更易出现数据持久性问题，且在二次收集时呈现政治左倾倾向。非争议性数据集中含潜在有害内容的数据留存率显著偏低。总体而言，右翼用户主导的话题更易遭受数据持久性影响。账户封禁是数据删除的主要因素——若非唯一因素。本研究成果强调，当采用的数据集易受数据持久性偏差影响时，实时数据采集对控制该偏差具有关键意义。