SemEval-2024 Task 8 is focused on multigenerator, multidomain, and multilingual black-box machine-generated text detection. Such a detection is important for preventing a potential misuse of large language models (LLMs), the newest of which are very capable in generating multilingual human-like texts. We have coped with this task in multiple ways, utilizing language identification and parameter-efficient fine-tuning of smaller LLMs for text classification. We have further used the per-language classification-threshold calibration to uniquely combine fine-tuned models predictions with statistical detection metrics to improve generalization of the system detection performance. Our submitted method achieved competitive results, ranking at the fourth place, just under 1 percentage point behind the winner.
翻译:SemEval-2024 任务 8 聚焦于多生成器、多领域、多语言的黑盒机器生成文本检测。此类检测对于防止大型语言模型(LLMs)的潜在滥用至关重要,其中最新的模型已能非常出色地生成多语言的类人文本。我们通过多种方式应对此任务,包括利用语言识别技术以及对较小规模 LLMs 进行参数高效的微调以用于文本分类。我们进一步采用基于语言的分类阈值校准方法,将微调模型的预测结果与统计检测指标独特地结合起来,以提升系统检测性能的泛化能力。我们提交的方法取得了具有竞争力的结果,排名第四,与优胜者的差距不到 1 个百分点。