Human annotation is central to NLP evaluation, yet subjective tasks often exhibit substantial variability across annotators. While large language models (LLMs) can provide structured reasoning to support annotation, their influence on human annotation behavior remains unclear. We introduce ReasonAlign, a reasoning-based annotation scaffold that exposes LLM-generated explanations while withholding predicted labels. We frame this as a controlled study of how reasoning affects human annotation behavior, rather than a full evaluation of annotation accuracy. Using a two-pass protocol inspired by Delphi-style revision, annotators first label instances independently and then revise their decisions after viewing model-generated reasoning. We evaluate the approach on sentiment classification and opinion detection tasks, analyzing changes in inter-annotator agreement and revision behavior. To quantify these effects, we introduce the Annotator Effort Proxy (AEP), a metric capturing the proportion of labels revised after exposure to reasoning. Our results show that exposure to reasoning is associated with increased agreement alongside minimal revision, suggesting that reasoning primarily helps resolve ambiguous cases without inducing widespread changes. These findings provide insight into how reasoning explanations shape annotation consistency and highlight reasoning-based scaffolds as a practical mechanism for supporting human-AI annotation workflows.
翻译:人类标注是自然语言处理评估的核心环节,但主观性任务常因标注者差异产生显著变异。虽然大语言模型(LLMs)能提供结构化推理支持标注,但其对人类标注行为的影响尚不明确。我们提出ReasonAlign——一种基于推理的标注支架,通过暴露LLM生成的解释而隐藏预测标签。本研究将其框定为探究推理如何影响人类标注行为的受控实验,而非对标注准确性的完整评估。采用德尔菲式修订机制的双阶段协议:标注者先独立完成初始标注,随后在查看模型生成的推理后修正决策。我们在情感分类与观点检测任务中评估该方法,重点分析标注者间一致性及修订行为的变化。为量化这些效应,我们提出标注者努力代理指标(AEP),该指标可计算暴露于推理后标签被修订的比例。研究结果表明,接触推理与一致性提升及最小修订量相关,说明推理主要帮助解决模糊案例而不引发广泛改变。这些发现揭示了推理解释如何塑造标注一致性,并证明基于推理的支架可作为支持人机协同标注工作流的实用机制。