NLP-ADBench: NLP Anomaly Detection Benchmark

Anomaly detection (AD) is a critical machine learning task with diverse applications in web systems, including fraud detection, content moderation, and user behavior analysis. Despite its significance, AD in natural language processing (NLP) remains underexplored, limiting advancements in detecting anomalies in text data such as harmful content, phishing attempts, or spam reviews. In this paper, we introduce NLP-ADBench, the most comprehensive benchmark for NLP anomaly detection (NLP-AD), comprising eight curated datasets and evaluations of nineteen state-of-the-art algorithms. These include three end-to-end methods and sixteen two-step algorithms that apply traditional anomaly detection techniques to language embeddings generated by bert-base-uncased and OpenAI's text-embedding-3-large models. Our results reveal critical insights and future directions for NLP-AD. Notably, no single model excels across all datasets, highlighting the need for automated model selection. Moreover, two-step methods leveraging transformer-based embeddings consistently outperform specialized end-to-end approaches, with OpenAI embeddings demonstrating superior performance over BERT embeddings. By releasing NLP-ADBench at https://github.com/USC-FORTIS/NLP-ADBench, we provide a standardized framework for evaluating NLP-AD methods, fostering the development of innovative approaches. This work fills a crucial gap in the field and establishes a foundation for advancing NLP anomaly detection, particularly in the context of improving the safety and reliability of web-based systems.

翻译：异常检测（AD）是一项关键的机器学习任务，在网络系统中具有广泛的应用，包括欺诈检测、内容审核和用户行为分析。尽管其重要性不言而喻，但自然语言处理（NLP）领域的异常检测研究仍显不足，这限制了在检测文本数据异常（如有害内容、网络钓鱼尝试或垃圾评论）方面的进展。本文介绍了NLP-ADBench，这是目前最全面的NLP异常检测（NLP-AD）基准，包含八个精选数据集以及对十九种最先进算法的评估。这些算法包括三种端到端方法和十六种两步算法，后者将传统异常检测技术应用于由bert-base-uncased和OpenAI的text-embedding-3-large模型生成的语言嵌入。我们的结果揭示了NLP-AD领域的关键见解和未来方向。值得注意的是，没有单一模型能在所有数据集上表现出色，这凸显了自动化模型选择的必要性。此外，利用基于Transformer的嵌入的两步方法持续优于专门的端到端方法，其中OpenAI嵌入的表现优于BERT嵌入。通过在https://github.com/USC-FORTIS/NLP-ADBench发布NLP-ADBench，我们为评估NLP-AD方法提供了一个标准化框架，以促进创新方法的发展。这项工作填补了该领域的一个重要空白，并为推进NLP异常检测奠定了基础，特别是在提升基于网络的系统的安全性和可靠性方面。