Behavioral analysis of tutoring dialogues is essential for understanding student learning, yet manual coding remains a bottleneck. We present a methodology where LLM coding agents autonomously improve the prompts used by LLM classifiers to label educational dialogues. In each iteration, a coding agent runs the classifier against human-labeled validation data, analyzes disagreements, and proposes theory-grounded prompt modifications for researcher review. Applying this approach to 659 AI tutoring sessions across four experiments with three agents and three classifiers, 4-fold cross-validation on held-out data confirmed genuine improvement: the best agent achieved test $κ=0.78$ (SD$=0.08$), matching human inter-rater reliability ($κ=0.78$), at a cost of approximately \$5--8 per agent. While development-set performance reached $κ=0.91$--$0.93$, the cross-validated results represent our primary generalization claim. The iterative process also surfaced an undocumented labeling pattern: human coders consistently treated expressions of confusion as engagement rather than disengagement. Continued iteration beyond the optimum led to regression, underscoring the need for held-out validation. We release all prompts, iteration logs, and data.
翻译:辅导对话的行为分析对于理解学生学习至关重要,但人工编码仍是瓶颈。我们提出一种方法,其中LLM编码代理能自主改进LLM分类器用于标注教育对话的提示。每轮迭代中,编码代理会针对人工标注的验证数据运行分类器,分析分歧点,并为研究者审阅提供基于理论基础的提示修改建议。将此方法应用于四个实验中的659次AI辅导会话(涉及三种代理和三种分类器),在留出数据上进行的4折交叉验证证实了实际改进:最佳代理在测试集上达到$κ=0.78$(标准差$=0.08$),与人工评分者间信度($κ=0.78$)相当,代理运行成本约为5-8美元。虽然开发集性能达到$κ=0.91$-$0.93$,但交叉验证结果代表了我们主要的泛化性结论。迭代过程还发现一种未记录的标注模式:人工编码者始终将困惑表达视为投入行为而非脱离行为。持续迭代超出最优值后产生性能退化,凸显了留出数据验证的必要性。我们开源所有提示、迭代日志及数据。