The potential of artificial intelligence (AI)-based large language models (LLMs) holds considerable promise in revolutionizing education, research, and practice. However, distinguishing between human-written and AI-generated text has become a significant task. This paper presents a comparative study, introducing a novel dataset of human-written and LLM-generated texts in different genres: essays, stories, poetry, and Python code. We employ several machine learning models to classify the texts. Results demonstrate the efficacy of these models in discerning between human and AI-generated text, despite the dataset's limited sample size. However, the task becomes more challenging when classifying GPT-generated text, particularly in story writing. The results indicate that the models exhibit superior performance in binary classification tasks, such as distinguishing human-generated text from a specific LLM, compared to the more complex multiclass tasks that involve discerning among human-generated and multiple LLMs. Our findings provide insightful implications for AI text detection while our dataset paves the way for future research in this evolving area.
翻译:基于人工智能的大语言模型在教育、研究与实践领域展现出巨大的变革潜力。然而,区分人类撰写文本与AI生成文本已成为一项重要挑战。本文提出一项比较研究,构建了涵盖散文、故事、诗歌与Python代码等不同体裁的人类撰写与大语言模型生成文本的新型数据集。我们采用多种机器学习模型对文本进行分类。结果表明,尽管数据集样本量有限,这些模型仍能有效区分人类与AI生成文本。但针对GPT生成文本(尤其是故事写作)的分类任务更具挑战性。研究显示,相较于需要区分人类文本与多种大语言模型生成文本的复杂多分类任务,模型在二元分类任务(如区分人类文本与特定大语言模型生成文本)中表现更优。本研究为AI文本检测提供了重要启示,同时所构建数据集为这一快速发展领域的后续研究奠定了基础。