Why AI-Generated Text Detection Fails: Evidence from Explainable AI Beyond Benchmark Accuracy

The widespread adoption of Large Language Models (LLMs) has made the detection of AI-Generated text a pressing and complex challenge. Although many detection systems report high benchmark accuracy, their reliability in real-world settings remains uncertain, and their interpretability is often unexplored. In this work, we investigate whether contemporary detectors genuinely identify machine authorship or merely exploit dataset-specific artefacts. We propose an interpretable detection framework that integrates linguistic feature engineering, machine learning, and explainable AI techniques. When evaluated on two prominent benchmark corpora, namely PAN CLEF 2025 and COLING 2025, our model trained on 30 linguistic features achieves leaderboard-competitive performance, attaining an F1 score of 0.9734. However, systematic cross-domain and cross-generator evaluation reveals substantial generalisation failure: classifiers that excel in-domain degrade significantly under distribution shift. Using SHAP- based explanations, we show that the most influential features differ markedly between datasets, indicating that detectors often rely on dataset-specific stylistic cues rather than stable signals of machine authorship. Further investigation with in-depth error analysis exposes a fundamental tension in linguistic-feature-based AI text detection: the features that are most discriminative on in-domain data are also the features most susceptible to domain shift, formatting variation, and text-length effects. We believe that this knowledge helps build AI detectors that are robust across different settings. To support replication and practical use, we release an open-source Python package that returns both predictions and instance-level explanations for individual texts.

翻译：大型语言模型（LLM）的广泛采用使得AI生成文本的检测成为一项紧迫且复杂的挑战。尽管许多检测系统报告了较高的基准精度，但其在真实场景中的可靠性仍不确定，且其可解释性通常未被探索。在本工作中，我们研究了当代检测器是否真正识别机器 authorship 还是仅仅利用了数据集特定的伪影。我们提出了一种可解释的检测框架，该框架融合了语言特征工程、机器学习以及可解释AI技术。当在PAN CLEF 2025和COLING 2025这两个著名的基准语料库上进行评估时，我们的模型（基于30个语言特征训练）达到了与排行榜竞争相当的性能，F1分数为0.9734。然而，系统的跨领域和跨生成器评估揭示了显著的泛化失败：在领域内表现卓越的分类器在分布偏移下性能显著下降。利用基于SHAP的解释，我们表明不同数据集之间最具影响力的特征存在显著差异，这表明检测器通常依赖于数据集特定的风格线索，而非机器 authorship 的稳定信号。通过深入错误分析的进一步调查，我们揭示了基于语言特征的AI文本检测中一个基本矛盾：领域内数据上最具判别力的特征恰恰也是最易受领域偏移、格式变化和文本长度效应影响的特征。我们相信，这一知识有助于构建在不同设置下均稳健的AI检测器。为支持复现和实际应用，我们发布了一个开源的Python包，该包可返回单个文本的预测结果及实例级解释。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

AI生成代码缺陷综述

专知会员服务

17+阅读 · 2025年12月8日

【NeurIPS2025】DNA-DetectLLM：基于 DNA 启发的“突变-修复”范式揭示 AI 生成文本

专知会员服务

12+阅读 · 2025年9月22日

AI生成媒体检测综述：从非多模态大语言模型到多模态大语言模型

专知会员服务

18+阅读 · 2025年2月11日

《人工智能生成式文本检测：数据集和数据生成》最新39页报告

专知会员服务

32+阅读 · 2024年12月18日