The emergence of large language models (LLMs) capable of generating realistic texts and images has sparked ethical concerns across various sectors. In response, researchers in academia and industry are actively exploring methods to distinguish AI-generated content from human-authored material. However, a crucial question remains: What are the unique characteristics of AI-generated text? Addressing this gap, this study proposes StyloAI, a data-driven model that uses 31 stylometric features to identify AI-generated texts by applying a Random Forest classifier on two multi-domain datasets. StyloAI achieves accuracy rates of 81% and 98% on the test set of the AuTextification dataset and the Education dataset, respectively. This approach surpasses the performance of existing state-of-the-art models and provides valuable insights into the differences between AI-generated and human-authored texts.
翻译:大型语言模型(LLMs)能够生成逼真的文本与图像,这一能力的涌现已在多个领域引发伦理担忧。为此,学术界与工业界的研究人员正积极探索区分人工智能生成内容与人类创作材料的有效方法。然而,一个关键问题仍悬而未决:人工智能生成文本究竟具有哪些独特性?为填补这一研究空白,本文提出数据驱动模型StyloAI,该模型通过提取31项文体特征,并基于随机森林分类器对两个多领域数据集进行识别分析。实验结果表明,StyloAI在AuTextification数据集测试集和Education数据集上分别达到81%和98%的准确率。本方法不仅超越现有最优模型的性能表现,更为深入理解人工智能生成文本与人类创作文本之间的差异提供了重要启示。