Large Language Models (LLMs) are increasingly deployed to automatically label and analyze educational dialogue at scale, yet current pipelines lack reliable ways to detect when models are wrong. We investigate whether reasoning generated by LLMs can be used to predict the correctness of a model's own predictions. We analyze 30,300 teacher utterances from classroom dialogue, each labeled by multiple state-of-the-art LLMs with an instructional move construct and an accompanying reasoning. Using human-verified ground-truth labels, we frame the task as predicting whether a model's assigned label for a given utterance is correct. We encode LLM reasoning using Term Frequency-Inverse Document Frequency (TF-IDF) and evaluate five supervised classifiers. A Random Forest classifier achieves an F1 score of 0.83 (Recall = 0.854), successfully identifying most incorrect predictions and outperforming baselines. Training specialist detectors for specific instructional move constructs further improves performance on difficult constructs, indicating that error detection benefits from construct-specific linguistic cues. Using the Linguistic Inquiry and Word Count (LIWC) framework, we examine four linguistic markers of correctness: Causation, Differentiation, Tentativeness, and Insight. Correct predictions exhibit grounded causal language (e.g., because, therefore), while incorrect reasoning is substantially more likely to rely on epistemic hedging (e.g., might, could) and performative metacognition (e.g., think, realize). Syntactic complexity does not distinguish correct from incorrect reasoning, and longer reasoning is not more reliable. These findings demonstrate that reasoning-based error detection offers a practical and scalable approach to quality control in automated educational dialogue analysis.
翻译:大语言模型正日益被用于大规模自动标注和分析教育对话,然而当前流程缺乏可靠方法来检测模型何时出错。我们研究是否可以利用大语言模型生成的推理来预测其自身预测的正确性。我们分析了30,300条课堂对话中的教师话语,每条话语均由多个前沿大语言模型标注了教学行为构念及相应推理。基于人工验证的真实标注,我们将任务构建为预测模型对给定话语的标注是否正确。我们使用词频-逆文档频率对LLM推理进行编码,并评估了五种监督分类器。随机森林分类器取得了0.83的F1分数(召回率=0.854),成功识别了大部分错误预测,且性能优于基线模型。针对特定教学行为构念训练专用检测器可进一步提升对困难构念的处理性能,这表明错误检测受益于构念特定的语言线索。运用语言探索与词汇计数框架,我们检验了正确性的四个语言标记:因果关系、区分性、试探性和洞察性。正确预测展现出基于事实的因果语言(例如"因为""因此"),而错误推理则显著更倾向于依赖认知性模糊表达(例如"可能""可以")和表演性元认知(例如"认为""意识到")。句法复杂性不能区分正确与错误推理,更长的推理也并非更可靠。这些发现表明,基于推理的错误检测为自动化教育对话分析提供了实用且可扩展的质量控制方法。