Large Language Models (LLMs) are increasingly used to annotate learning interactions, yet concerns about reliability limit their utility. We test whether verification-oriented orchestration-prompting models to check their own labels (self-verification) or audit one another (cross-verification)-improves qualitative coding of tutoring discourse. Using transcripts from 30 one-to-one math sessions, we compare three production LLMs (GPT, Claude, Gemini) under three conditions: unverified annotation, self-verification, and cross-verification across all orchestration configurations. Outputs are benchmarked against a blinded, disagreement-focused human adjudication using Cohen's kappa. Overall, orchestration yields a 58 percent improvement in kappa. Self-verification nearly doubles agreement relative to unverified baselines, with the largest gains for challenging tutor moves. Cross-verification achieves a 37 percent improvement on average, with pair- and construct-dependent effects: some verifier-annotator pairs exceed self-verification, while others reduce alignment, reflecting differences in verifier strictness. We contribute: (1) a flexible orchestration framework instantiating control, self-, and cross-verification; (2) an empirical comparison across frontier LLMs on authentic tutoring data with blinded human "gold" labels; and (3) a concise notation, verifier(annotator) (e.g., Gemini(GPT) or Claude(Claude)), to standardize reporting and make directional effects explicit for replication. Results position verification as a principled design lever for reliable, scalable LLM-assisted annotation in Learning Analytics.
翻译:大型语言模型(LLM)正日益广泛地用于学习交互的标注,但其可靠性问题限制了其实用性。本研究检验以验证为导向的编排策略——即提示模型检查自身标注(自我验证)或相互审核(交叉验证)——能否提升辅导话语定性编码的质量。基于30段一对一数学辅导的转录文本,我们在三种编排配置下比较了三种主流LLM(GPT、Claude、Gemini)在三种条件下的表现:无验证标注、自我验证及全编排配置下的交叉验证。输出结果以盲审、聚焦分歧的人工裁定为基准,采用科恩卡帕系数进行评估。总体而言,编排策略使卡帕系数提升了58%。自我验证相较于无验证基线将一致性提升了近一倍,在具有挑战性的辅导行为标注上提升最为显著。交叉验证平均实现了37%的改进,其效果呈现配对与构念依赖性:部分验证器-标注器组合的表现优于自我验证,而其他组合则会降低一致性,这反映了验证器严格度的差异。本研究的贡献包括:(1)提出实现控制验证、自我验证与交叉验证的灵活编排框架;(2)在真实辅导数据上,以盲审人工“黄金”标注为基准,对前沿LLM进行实证比较;(3)设计简明标注体系verifier(annotator)(如Gemini(GPT)或Claude(Claude)),以标准化报告格式并明确方向性效应,便于复现研究。研究结果表明,验证机制可作为学习分析中实现可靠、可扩展LLM辅助标注的原则性设计杠杆。