From Intuition to Calibrated Judgment: A Rubric-Based Expert-Panel Study of Human Detection of LLM-Generated Korean Text

Distinguishing human-written Korean text from fluent LLM outputs remains difficult even for trained readers, who can over-trust surface well-formedness. We present LREAD, a Korean-specific instantiation of a rubric-based expert-calibration framework for human attribution of LLM-generated text. In a three-phase blind longitudinal study with three linguistically trained annotators, Phase 1 measures intuition-only attribution, Phase 2 introduces criterion-anchored scoring with explicit justifications, and Phase 3 evaluates a limited held-out elementary-persona subset. Majority-vote accuracy improves from 0.60 in Phase 1 to 0.90 in Phase 2, and reaches 10/10 on the limited Phase 3 subset (95% CI [0.692, 1.000]); agreement also increases from Fleiss' $κ$ = -0.09 to 0.82. Error analysis suggests that calibration primarily reduces false negatives on AI essays rather than inducing generalized over-detection. We position LREAD as pilot evidence for within-panel calibration in a Korean argumentative-essay setting. These findings suggest that rubric-scaffolded human judgment can complement automated detectors by making attribution reasoning explicit, auditable, and adaptable.

翻译：区分人类撰写的韩语文本与流畅的大型语言模型输出，即使对于受过训练的读者而言仍然困难，他们可能过度信赖表面的良好形式。我们提出了LREAD，这是一种针对韩语的、基于评分量表的专家校准框架实例，用于人类对LLM生成文本的归属判定。在一项包含三位受过语言学训练的标注者的三阶段盲法纵向研究中，第一阶段测量仅凭直觉的归属判定，第二阶段引入基于明确理由的准则锚定评分，第三阶段评估一个有限的、保留的基础人物角色子集。多数投票准确率从第一阶段的0.60提高到第二阶段的0.90，并在有限的第三阶段子集上达到10/10（95% CI [0.692, 1.000]）；一致性也从Fleiss' $κ$ = -0.09增加到0.82。错误分析表明，校准主要减少了在AI生成文章上的假阴性，而非引发普遍的过度检测。我们将LREAD定位为在韩语议论文情境下，小组内部校准的初步证据。这些发现表明，基于评分量表搭建的人类判断可以通过使归属推理过程变得明确、可审计和可调整，来补充自动检测器。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

《以人为中心的大型语言模型（LLM）研究综述》

专知会员服务

41+阅读 · 2024年11月25日

揭示生成式人工智能 / 大型语言模型（LLMs）的军事潜力

专知会员服务

31+阅读 · 2024年9月26日

大模型如何生成可控文本？人大等最新《大型语言模型的可控文本生成》综述

专知会员服务

37+阅读 · 2024年8月23日

GPT文本如何检测？《检测AI生成文本：影响当前方法检测能力的因素》最新综述

专知会员服务

24+阅读 · 2024年7月3日