Human annotation is central to NLP evaluation, yet subjective tasks often exhibit substantial variability across annotators. While large language models (LLMs) can provide structured reasoning to support annotation, their influence on human annotation behavior remains underexplored. We introduce \textbf{ReasonScaffold}, a scaffolded reasoning annotation protocol that exposes LLM-generated explanations while withholding predicted labels. We study how reasoning affects human annotation behavior in a controlled setting, rather than evaluating annotation accuracy. Using a two-pass protocol inspired by Delphi-style revision, annotators first label instances independently and then revise their decisions after viewing model-generated reasoning. We evaluate the approach on sentiment classification and opinion detection tasks, analyzing changes in inter-annotator agreement and revision behavior. To quantify these effects, we introduce the Annotator Effort Proxy (AEP), a metric capturing the proportion of labels revised after exposure to reasoning. Our results show that exposure to reasoning is associated with increased agreement, along with minimal revision, suggesting that reasoning helps resolve ambiguous cases without inducing widespread changes. These findings provide insight into how reasoning explanations shape annotation consistency and highlight reasoning-based scaffolds as a practical mechanism for human--AI co-annotation workflows.
翻译:人工标注是自然语言处理评估的核心,然而主观任务常因标注者不同而产生显著差异。尽管大语言模型(LLM)能够通过结构化推理辅助标注,但其对人类标注行为的影响仍未被充分探究。本文提出 **ReasonScaffold**——一种脚手架式推理标注协议,该协议暴露LLM生成的解释文本,但隐藏其预测标签。我们聚焦于受控环境下推理如何影响人类标注行为,而非评估标注准确性。采用受德尔菲式修订启发的两阶段协议,标注者首先独立完成标签标注,随后在查阅模型生成的推理后修订其决策。我们在情感分类与观点检测任务上评估该方法,分析标注者间一致性与修订行为的变化。为量化这些效应,我们引入标注者努力代理指标(Annotator Effort Proxy,AEP),该指标衡量受推理影响后修订标签的比例。结果表明:暴露推理与一致性提升显著相关,且伴随极少量修订,说明推理有助于解决歧义案例而不引发大规模修改。这些发现揭示了推理解释如何塑造标注一致性,并凸显了基于推理的脚手架机制作为人机协同标注流程的实用方案。