Preventing the spread of misinformation is challenging. The detection of misleading content presents a significant hurdle due to its extreme linguistic and domain variability. Content-based models have managed to identify deceptive language by learning representations from textual data such as social media posts and web articles. However, aggregating representative samples of this heterogeneous phenomenon and implementing effective real-world applications is still elusive. Based on analytical work on the language of misinformation, this paper analyzes the linguistic attributes that characterize this phenomenon and how representative of such features some of the most popular misinformation datasets are. We demonstrate that the appropriate use of pertinent symbolic knowledge in combination with neural language models is helpful in detecting misleading content. Our results achieve state-of-the-art performance in misinformation datasets across the board, showing that our approach offers a valid and robust alternative to multi-task transfer learning without requiring any additional training data. Furthermore, our results show evidence that structured knowledge can provide the extra boost required to address a complex and unpredictable real-world problem like misinformation detection, not only in terms of accuracy but also time efficiency and resource utilization.
翻译:阻止虚假信息的传播具有挑战性。由于误导性内容在语言和领域上的极端多样性,其检测面临重大障碍。基于内容的模型通过从社交媒体帖子和网络文章等文本数据中学习表征,已成功识别出欺骗性语言。然而,聚合这一异质性现象的代表性样本并实现有效的实际应用仍难以实现。基于对虚假信息语言的分析性研究,本文剖析了表征这一现象的语言特征,以及当前最流行的虚假信息数据集对这些特征的覆盖程度。我们证明,恰当结合相关符号知识与神经语言模型有助于检测误导性内容。我们的结果在所有虚假信息数据集上均达到最先进的性能,表明该方法无需额外训练数据就能为多任务迁移学习提供有效且稳健的替代方案。此外,研究结果还显示,结构化知识能够提供解决虚假信息检测这一复杂且不可预测的现实问题所需的额外推动力,不仅体现在准确性方面,在时间效率和资源利用率上同样如此。