This paper presents a novel design of the system aimed at supporting the Wikipedia community in addressing vandalism on the platform. To achieve this, we collected a massive dataset of 47 languages, and applied advanced filtering and feature engineering techniques, including multilingual masked language modeling to build the training dataset from human-generated data. The performance of the system was evaluated through comparison with the one used in production in Wikipedia, known as ORES. Our research results in a significant increase in the number of languages covered, making Wikipedia patrolling more efficient to a wider range of communities. Furthermore, our model outperforms ORES, ensuring that the results provided are not only more accurate but also less biased against certain groups of contributors.
翻译:本文提出了一种新颖的系统设计,旨在支持维基百科社区应对平台上的破坏行为。为此,我们收集了涵盖47种语言的大规模数据集,并应用了先进的过滤与特征工程技术,包括使用多语言掩码语言模型从人工生成的数据中构建训练集。通过与维基百科生产环境中使用的ORES系统进行对比评估,本研究的成果显著扩大了系统覆盖的语言数量,使更多语言社区能够更高效地参与维基百科巡查。此外,我们的模型在性能上超越了ORES,不仅提供更准确的结果,还降低了对特定贡献者群体的偏见。