Wikipedia can be edited by anyone and thus contains various quality sentences. Therefore, Wikipedia includes some poor-quality edits, which are often marked up by other editors. While editors' reviews enhance the credibility of Wikipedia, it is hard to check all edited text. Assisting in this process is very important, but a large and comprehensive dataset for studying it does not currently exist. Here, we propose WikiSQE, the first large-scale dataset for sentence quality estimation in Wikipedia. Each sentence is extracted from the entire revision history of Wikipedia, and the target quality labels were carefully investigated and selected. WikiSQE has about 3.4 M sentences with 153 quality labels. In the experiment with automatic classification using competitive machine learning models, sentences that had problems with citation, syntax/semantics, or propositions were found to be more difficult to detect. In addition, we conducted automated essay scoring experiments to evaluate the generalizability of the dataset. We show that the models trained on WikiSQE perform better than the vanilla model, indicating its potential usefulness in other domains. WikiSQE is expected to be a valuable resource for other tasks in NLP.
翻译:维基百科可由任何人编辑,因此包含各种质量的句子。这些编辑中可能存在低质量内容,通常会被其他编辑标记。尽管编辑审核能提升维基百科的可信度,但逐一检查所有编辑文本仍十分困难。辅助这一过程至关重要,但目前尚缺乏用于此类研究的大规模、全面数据集。本文提出WikiSQE——首个面向维基百科句子质量评估的大规模数据集。该数据集中的每个句子均提取自维基百科完整修订历史,并经过审慎调查与筛选确定目标质量标签。WikiSQE包含约340万条句子及153个质量标签。在使用竞争性机器学习模型进行自动分类的实验中,存在引用、句法/语义或命题问题的句子更难被检测。此外,我们开展了自动作文评分实验以评估数据集的泛化能力。结果表明,基于WikiSQE训练的模型性能优于基础模型,说明该数据集在其他领域具有潜在实用价值。WikiSQE有望成为自然语言处理领域其他任务的宝贵资源。