Wikipedia can be edited by anyone and thus contains various quality sentences. Therefore, Wikipedia includes some poor-quality edits, which are often marked up by other editors. While editors' reviews enhance the credibility of Wikipedia, it is hard to check all edited text. Assisting in this process is very important, but a large and comprehensive dataset for studying it does not currently exist. Here, we propose WikiSQE, the first large-scale dataset for sentence quality estimation in Wikipedia. Each sentence is extracted from the entire revision history of English Wikipedia, and the target quality labels were carefully investigated and selected. WikiSQE has about 3.4 M sentences with 153 quality labels. In the experiment with automatic classification using competitive machine learning models, sentences that had problems with citation, syntax/semantics, or propositions were found to be more difficult to detect. In addition, by performing human annotation, we found that the model we developed performed better than the crowdsourced workers. WikiSQE is expected to be a valuable resource for other tasks in NLP.
翻译:维基百科允许任何人编辑,因此其中包含各种质量的句子。同时,维基百科也存在一些质量低劣的编辑内容,这些内容通常会被其他编辑者标注。尽管编辑者的审查提升了维基百科的可信度,但逐一检查所有编辑文本仍存在困难。为此类审核流程提供辅助具有重要意义,但目前尚缺乏用于相关研究的大规模综合数据集。本文提出WikiSQE——首个面向维基百科句子质量评估的大规模数据集。该数据集中的每个句子均提取自英语维基百科的完整修订历史,且目标质量标签经过审慎调研与筛选。WikiSQE包含约340万条句子,涵盖153个质量标签。在使用竞争性机器学习模型进行的自动分类实验中,存在引文、句法/语义或命题问题的句子更难被检测。此外,通过人工标注实验,我们发现所开发的模型性能优于众包工作者。WikiSQE有望成为自然语言处理领域其他任务的重要资源。