Distinguishing human-written Korean text from fluent LLM outputs remains difficult even for trained readers, who can over-trust surface well-formedness. We present LREAD, a Korean-specific instantiation of a rubric-based expert-calibration framework for human attribution of LLM-generated text. In a three-phase blind longitudinal study with three linguistically trained annotators, Phase 1 measures intuition-only attribution, Phase 2 introduces criterion-anchored scoring with explicit justifications, and Phase 3 evaluates a limited held-out elementary-persona subset. Majority-vote accuracy improves from 0.60 in Phase 1 to 0.90 in Phase 2, and reaches 10/10 on the limited Phase 3 subset (95% CI [0.692, 1.000]); agreement also increases from Fleiss' $κ$ = -0.09 to 0.82. Error analysis suggests that calibration primarily reduces false negatives on AI essays rather than inducing generalized over-detection. We position LREAD as pilot evidence for within-panel calibration in a Korean argumentative-essay setting. These findings suggest that rubric-scaffolded human judgment can complement automated detectors by making attribution reasoning explicit, auditable, and adaptable.
翻译:区分人类撰写的韩语文本与流畅的大型语言模型输出,即使对于受过训练的读者而言仍然困难,他们可能过度信赖表面的良好形式。我们提出了LREAD,这是一种针对韩语的、基于评分量表的专家校准框架实例,用于人类对LLM生成文本的归属判定。在一项包含三位受过语言学训练的标注者的三阶段盲法纵向研究中,第一阶段测量仅凭直觉的归属判定,第二阶段引入基于明确理由的准则锚定评分,第三阶段评估一个有限的、保留的基础人物角色子集。多数投票准确率从第一阶段的0.60提高到第二阶段的0.90,并在有限的第三阶段子集上达到10/10(95% CI [0.692, 1.000]);一致性也从Fleiss' $κ$ = -0.09增加到0.82。错误分析表明,校准主要减少了在AI生成文章上的假阴性,而非引发普遍的过度检测。我们将LREAD定位为在韩语议论文情境下,小组内部校准的初步证据。这些发现表明,基于评分量表搭建的人类判断可以通过使归属推理过程变得明确、可审计和可调整,来补充自动检测器。