Text anomaly detection (TAD) plays a critical role in various language-driven real-world applications, including harmful content moderation, phishing detection, and spam review filtering. While two-step "embedding-detector" TAD methods have shown state-of-the-art performance, their effectiveness is often limited by the use of a single embedding model and the lack of adaptability across diverse datasets and anomaly types. To address these limitations, we propose to exploit the embeddings from multiple pretrained language models and integrate them into $MCA^2$, a multi-view TAD framework. $MCA^2$ adopts a multi-view reconstruction model to effectively extract normal textual patterns from multiple embedding perspectives. To exploit inter-view complementarity, a contrastive collaboration module is designed to leverage and strengthen the interactions across different views. Moreover, an adaptive allocation module is developed to automatically assign the contribution weight of each view, thereby improving the adaptability to diverse datasets. Extensive experiments on 10 benchmark datasets verify the effectiveness of $MCA^2$ against strong baselines. The source code of $MCA^2$ is available at https://github.com/yankehan/MCA2.
翻译:文本异常检测(TAD)在多种语言驱动的现实应用中扮演着关键角色,包括有害内容审核、钓鱼检测和垃圾评论过滤。尽管两阶段的“嵌入-检测器”TAD方法已展现出最先进的性能,但其有效性常受限于使用单一嵌入模型以及缺乏跨不同数据集和异常类型的适应性。为应对这些局限性,我们提出利用多个预训练语言模型的嵌入,并将其集成到$MCA^2$——一个多视图TAD框架中。$MCA^2$采用多视图重构模型,以从多个嵌入视角有效提取正常文本模式。为利用视图间的互补性,设计了一个对比协作模块,以利用并加强不同视图间的交互。此外,开发了一个自适应分配模块,用于自动分配每个视图的贡献权重,从而提升对不同数据集的适应性。在10个基准数据集上的大量实验验证了$MCA^2$相较于强基线的有效性。$MCA^2$的源代码可在 https://github.com/yankehan/MCA2 获取。