AI标注编排：评估LLM验证器以提升学习分析中LLM标注质量 (AI Annotation Orchestration: Evaluating LLM verifiers to Improve the Quality of LLM Annotations in Learning Analytics)

Large Language Models (LLMs) are increasingly used to annotate learning interactions, yet concerns about reliability limit their utility. We test whether verification-oriented orchestration-prompting models to check their own labels (self-verification) or audit one another (cross-verification)-improves qualitative coding of tutoring discourse. Using transcripts from 30 one-to-one math sessions, we compare three production LLMs (GPT, Claude, Gemini) under three conditions: unverified annotation, self-verification, and cross-verification across all orchestration configurations. Outputs are benchmarked against a blinded, disagreement-focused human adjudication using Cohen's kappa. Overall, orchestration yields a 58 percent improvement in kappa. Self-verification nearly doubles agreement relative to unverified baselines, with the largest gains for challenging tutor moves. Cross-verification achieves a 37 percent improvement on average, with pair- and construct-dependent effects: some verifier-annotator pairs exceed self-verification, while others reduce alignment, reflecting differences in verifier strictness. We contribute: (1) a flexible orchestration framework instantiating control, self-, and cross-verification; (2) an empirical comparison across frontier LLMs on authentic tutoring data with blinded human "gold" labels; and (3) a concise notation, verifier(annotator) (e.g., Gemini(GPT) or Claude(Claude)), to standardize reporting and make directional effects explicit for replication. Results position verification as a principled design lever for reliable, scalable LLM-assisted annotation in Learning Analytics.

翻译：大型语言模型（LLM）正日益广泛地用于学习交互的标注，但其可靠性问题限制了其实用性。本研究检验以验证为导向的编排策略——即提示模型检查自身标注（自我验证）或相互审核（交叉验证）——能否提升辅导话语定性编码的质量。基于30段一对一数学辅导的转录文本，我们在三种编排配置下比较了三种主流LLM（GPT、Claude、Gemini）在三种条件下的表现：无验证标注、自我验证及全编排配置下的交叉验证。输出结果以盲审、聚焦分歧的人工裁定为基准，采用科恩卡帕系数进行评估。总体而言，编排策略使卡帕系数提升了58%。自我验证相较于无验证基线将一致性提升了近一倍，在具有挑战性的辅导行为标注上提升最为显著。交叉验证平均实现了37%的改进，其效果呈现配对与构念依赖性：部分验证器-标注器组合的表现优于自我验证，而其他组合则会降低一致性，这反映了验证器严格度的差异。本研究的贡献包括：（1）提出实现控制验证、自我验证与交叉验证的灵活编排框架；（2）在真实辅导数据上，以盲审人工“黄金”标注为基准，对前沿LLM进行实证比较；（3）设计简明标注体系verifier(annotator)（如Gemini(GPT)或Claude(Claude)），以标准化报告格式并明确方向性效应，便于复现研究。研究结果表明，验证机制可作为学习分析中实现可靠、可扩展LLM辅助标注的原则性设计杠杆。