Despite significant progress in text anomaly detection for web applications such as spam filtering and fake news detection, existing methods are fundamentally limited to document-level analysis, unable to identify which specific parts of a text are anomalous. We introduce token-level anomaly detection, a novel paradigm that enables fine-grained localization of anomalies within text. We formally define text anomalies at both document and token-levels, and propose a unified detection framework that operates across multiple levels. To facilitate research in this direction, we collect and annotate three benchmark datasets spanning spam, reviews and grammar errors with token-level labels. Experimental results demonstrate that our framework get better performance than other 6 baselines, opening new possibilities for precise anomaly localization in text. All the codes and data are publicly available on https://github.com/charles-cao/TokenCore.
翻译:尽管在垃圾邮件过滤和虚假新闻检测等网络应用的文本异常检测方面取得了显著进展,但现有方法从根本上局限于文档级分析,无法识别文本中哪些具体部分存在异常。我们引入了词元级异常检测这一新范式,它能够在文本内部实现细粒度的异常定位。我们正式定义了文档级和词元级的文本异常,并提出了一个跨多层级运行的统一检测框架。为了促进这一方向的研究,我们收集并标注了三个涵盖垃圾邮件、评论和语法错误的基准数据集,并提供了词元级标签。实验结果表明,我们的框架相比其他6个基线方法取得了更优的性能,为文本中的精确异常定位开辟了新的可能性。所有代码和数据均已公开在 https://github.com/charles-cao/TokenCore。