Current content moderation practices follow the trial-and-error approach, meaning that moderators apply sequences of interventions until they obtain the desired outcome. However, being able to preemptively estimate the effects of an intervention would allow moderators the unprecedented opportunity to plan their actions ahead of application. As a first step towards this goal, here we propose and tackle the novel task of predicting the effect of a moderation intervention. We study the reactions of 16,540 users to a massive ban of online communities on Reddit, training a set of binary classifiers to identify those users who would abandon the platform after the intervention - a problem of great practical relevance. We leverage a dataset of 13.8M posts to compute a large and diverse set of 142 features, which convey information about the activity, toxicity, relations, and writing style of the users. We obtain promising results, with the best-performing model achieving micro F1 = 0.800 and macro F1 = 0.676. Our model demonstrates robust generalizability when applied to users from previously unseen communities. Furthermore, we identify activity features as the most informative predictors, followed by relational and toxicity features, while writing style features exhibit limited utility. Our results demonstrate the feasibility of predicting the effects of a moderation intervention, paving the way for a new research direction in predictive content moderation aimed at empowering moderators with intelligent tools to plan ahead their actions.
翻译:当前的内容审核实践遵循试错方法,即审核员会依次采取一系列干预措施,直至达到预期效果。然而,若能预先评估干预措施的效果,将赋予审核员前所未有的机会,使其能够在应用干预前预先规划行动。为此,作为实现该目标的第一步,本文提出并探讨了预测审核干预效果这一新任务。我们研究了16,540名用户在Reddit上大规模封禁在线社区后的反应,训练了一组二元分类器来识别干预后可能离开平台的用户——这是一个具有重要实践意义的问题。我们利用包含1,380万条帖子的数据集,计算了142个多样且丰富的特征,这些特征传递了用户的活动、毒性、关系及写作风格信息。我们取得了令人鼓舞的结果,最佳模型的微平均F1得分为0.800,宏平均F1得分为0.676。当应用于来自未见社区的用户的分类任务时,该模型展现出稳健的泛化能力。此外,我们发现活动特征是最具信息量的预测因子,其次是关系特征和毒性特征,而写作风格特征的作用有限。本研究结果证明了预测审核干预效果的可行性,为预测性内容审核领域开辟了新方向,旨在为审核员提供智能工具,以提前规划其行动。