Hoaxes are a recognised form of disinformation created deliberately, with potential serious implications in the credibility of reference knowledge resources such as Wikipedia. What makes detecting Wikipedia hoaxes hard is that they often are written according to the official style guidelines. In this work, we first provide a systematic analysis of the similarities and discrepancies between legitimate and hoax Wikipedia articles, and introduce Hoaxpedia, a collection of 311 Hoax articles (from existing literature as well as official Wikipedia lists) alongside semantically similar real articles. We report results of binary classification experiments in the task of predicting whether a Wikipedia article is real or hoax, and analyze several settings as well as a range of language models. Our results suggest that detecting deceitful content in Wikipedia based on content alone, despite not having been explored much in the past, is a promising direction.
翻译:虚假信息是蓄意制造的公认误导形式,可能对维基百科等参考知识资源的可信度产生严重影响。检测维基百科虚假条目的难点在于,这些条目往往遵循官方格式规范编写。本研究首先对合法维基百科条目与虚假条目之间的异同进行了系统性分析,并构建了Hoaxpedia数据集——该数据集包含311条虚假条目(源自现有文献及维基百科官方列表)及其语义相似的对应真实条目。我们开展了二元分类实验来预测维基百科条目真伪,报告了实验结果,并分析了多种实验设定及一系列语言模型的表现。研究结果表明,尽管过去相关探索有限,但仅基于内容检测维基百科中的欺骗性内容仍是一个具有前景的研究方向。