A crucial aspect of a rumor detection model is its ability to generalize, particularly its ability to detect emerging, previously unknown rumors. Past research has indicated that content-based (i.e., using solely source posts as input) rumor detection models tend to perform less effectively on unseen rumors. At the same time, the potential of context-based models remains largely untapped. The main contribution of this paper is in the in-depth evaluation of the performance gap between content and context-based models specifically on detecting new, unseen rumors. Our empirical findings demonstrate that context-based models are still overly dependent on the information derived from the rumors' source post and tend to overlook the significant role that contextual information can play. We also study the effect of data split strategies on classifier performance. Based on our experimental results, the paper also offers practical suggestions on how to minimize the effects of temporal concept drift in static datasets during the training of rumor detection methods.
翻译:谣言检测模型的一个关键能力是其泛化能力,尤其是检测新出现的、未知谣言的能力。过往研究表明,基于内容(即仅使用源帖子作为输入)的谣言检测模型在应对未见谣言时表现往往较差。与此同时,基于上下文模型的潜力在很大程度上仍未得到充分挖掘。本文的主要贡献在于深入评估了基于内容与基于上下文的模型在专门检测新型、未见谣言方面的性能差距。我们的实验发现表明,基于上下文的模型仍然过度依赖从谣言源帖子中获取的信息,而忽略了上下文信息可能发挥的重要作用。我们还研究了数据切分策略对分类器性能的影响。基于实验结果,本文还提出了实用建议,以在谣言检测方法训练过程中最小化静态数据集中时间概念漂移的影响。