Hoaxes are a recognised form of disinformation created deliberately, with potential serious implications in the credibility of reference knowledge resources such as Wikipedia. What makes detecting Wikipedia hoaxes hard is that they often are written according to the official style guidelines. In this work, we first provide a systematic analysis of the similarities and discrepancies between legitimate and hoax Wikipedia articles, and introduce Hoaxpedia, a collection of 311 Hoax articles (from existing literature as well as official Wikipedia lists) alongside semantically similar real articles. We report results of binary classification experiments in the task of predicting whether a Wikipedia article is real or hoax, and analyze several settings as well as a range of language models. Our results suggest that detecting deceitful content in Wikipedia based on content alone, despite not having been explored much in the past, is a promising direction.
翻译:恶作剧是一种故意制造的、公认的虚假信息形式,对维基百科等参考知识资源的可信度可能产生严重影响。识别维基百科恶作剧文章之所以困难,是因为它们往往遵循官方风格指南撰写。本文首先对维基百科合法文章与恶作剧文章的异同进行了系统分析,并介绍了Hoaxpedia数据集——包含311篇恶作剧文章(来自现有文献及维基百科官方列表)及其语义相似的真实文章。我们报告了二分类实验的结果(预测维基百科文章为真实或恶作剧),并分析了多种设置及一系列语言模型。实验结果表明,尽管过去鲜有探索,但仅基于内容检测维基百科欺骗性内容仍是一个有前景的研究方向。