SemEval-2024 Task 8 introduces the challenge of identifying machine-generated texts from diverse Large Language Models (LLMs) in various languages and domains. The task comprises three subtasks: binary classification in monolingual and multilingual (Subtask A), multi-class classification (Subtask B), and mixed text detection (Subtask C). This paper focuses on Subtask A & B. Each subtask is supported by three datasets for training, development, and testing. To tackle this task, two methods: 1) using traditional machine learning (ML) with natural language preprocessing (NLP) for feature extraction, and 2) fine-tuning LLMs for text classification. The results show that transformer models, particularly LoRA-RoBERTa, exceed traditional ML methods in effectiveness, with majority voting being particularly effective in multilingual contexts for identifying machine-generated texts.
翻译:SemEval-2024任务8提出了从多种语言和领域中的不同大型语言模型(LLM)中识别机器生成文本的挑战。该任务包含三个子任务:单语言和多语言环境下的二分类(子任务A)、多分类(子任务B)以及混合文本检测(子任务C)。本文聚焦于子任务A与B。每个子任务均由三个数据集支持,分别用于训练、开发和测试。为解决该任务,我们采用了两种方法:1)利用传统机器学习(ML)结合自然语言预处理(NLP)进行特征提取;2)微调LLM用于文本分类。结果表明,Transformer模型(尤其是LoRA-RoBERTa)在效果上优于传统ML方法,其中多数投票法在多语言环境下识别机器生成文本方面尤为有效。