Automated annotation of pedagogical dialogue is a high-stakes task where LLMs often fail without sufficient domain grounding. We present a domain-adapted RAG pipeline for tutoring move annotation. Rather than fine-tuning the generative model, we adapt retrieval by fine-tuning a lightweight embedding model on tutoring corpora and indexing dialogues at the utterance level to retrieve labeled few-shot demonstrations. Evaluated across two real tutoring dialogue datasets (TalkMoves and Eedi) and three LLM backbones (GPT-5.2, Claude Sonnet 4.6, Qwen3-32b), our best configuration achieves Cohen's $κ$ of 0.526-0.580 on TalkMoves and 0.659-0.743 on Eedi, substantially outperforming no-retrieval baselines ($κ= 0.275$-$0.413$ and $0.160$-$0.410$). An ablation study reveals that utterance-level indexing, rather than embedding quality alone, is the primary driver of these gains, with top-1 label match rates improving from 39.7\% to 62.0\% on TalkMoves and 52.9\% to 73.1\% on Eedi under domain-adapted retrieval. Retrieval also corrects systematic label biases present in zero-shot prompting and yields the largest improvements for rare and context-dependent labels. These findings suggest that adapting the retrieval component alone is a practical and effective path toward expert-level pedagogical dialogue annotation while keeping the generative model frozen.
翻译:教学对话的自动标注是一项高风险任务,若无充分的领域基础,大语言模型往往表现不佳。我们提出了一种面向辅导动作标注的领域自适应RAG流水线。该方法不微调生成模型,而是通过在教学语料上微调轻量级嵌入模型、并对对话进行话语级索引来检索带标签的少样本示例。在两个真实教学对话数据集(TalkMoves和Eedi)及三个大语言模型主干(GPT-5.2、Claude Sonnet 4.6、Qwen3-32b)上的评估表明,我们最优配置在TalkMoves上取得Cohen’s κ值0.526-0.580,在Eedi上取得0.659-0.743,显著优于无检索基线(κ值分别为0.275-0.413和0.160-0.410)。消融实验揭示,话语级索引(而非嵌入质量本身)是性能提升的主要驱动因素:在领域自适应检索下,TalkMoves和Eedi的top-1标签匹配率分别从39.7%提升至62.0%,从52.9%提升至73.1%。检索还能纠正零样本提示中存在的系统性标签偏差,并对稀有标签和上下文依赖标签带来最大改善。这些发现表明,仅适配检索组件是在保持生成模型冻结的情况下,实现专家级教学对话标注的实用有效路径。