The rapid development of large language models has led to an increase in AI-generated text, with students increasingly using LLM-generated content as their own work, which violates academic integrity. This paper presents an evaluation of AI text detection methods, including both traditional machine learning models and transformer-based architectures. We utilize two datasets, HC3 and DAIGT v2, to build a unified benchmark and apply a topic-based data split to prevent information leakage. This approach ensures robust generalization across unseen domains. Our experiments show that TF-IDF logistic regression achieves a reasonable baseline accuracy of 82.87%. However, deep learning models outperform it. The BiLSTM classifier achieves an accuracy of 88.86%, while DistilBERT achieves a similar accuracy of 88.11% with the highest ROC-AUC score of 0.96, demonstrating the strongest overall performance. The results indicate that contextual semantic modeling is significantly superior to lexical features and highlight the importance of mitigating topic memorization through appropriate evaluation protocols. The limitations of this work are primarily related to dataset diversity and computational constraints. In future work, we plan to expand dataset diversity and utilize parameter-efficient fine-tuning methods such as LoRA. We also plan to explore smaller or distilled models and employ more efficient batching strategies and hardware-aware optimization.
翻译:大型语言模型的快速发展导致AI生成文本日益增多,学生越来越多地将LLM生成的内容作为自己的作品使用,这违反了学术诚信。本文对AI文本检测方法进行了评估,包括传统机器学习模型和基于Transformer的架构。我们利用HC3和DAIGT v2两个数据集构建统一基准,并采用基于主题的数据划分以防止信息泄露。该方法确保了在未见领域上的鲁棒泛化能力。实验表明,TF-IDF逻辑回归达到了82.87%的合理基线准确率,但深度学习模型表现更优。BiLSTM分类器实现了88.86%的准确率,而DistilBERT达到了相近的88.11%准确率,并以0.96的最高ROC-AUC分数展现出最强的综合性能。结果表明,上下文语义建模显著优于词汇特征,并凸显了通过适当评估协议缓解主题记忆的重要性。本工作的局限性主要涉及数据集多样性和计算资源约束。在未来的工作中,我们计划扩展数据集多样性,并采用参数高效微调方法(如LoRA)。同时,我们计划探索更小或蒸馏后的模型,并采用更高效的批处理策略和硬件感知优化。