ConvoLearn：一个建构主义导师-学生对话数据集 (ConvoLearn: A Dataset of Constructivist Tutor-Student Dialogue)

In educational applications, LLMs exhibit several fundamental pedagogical limitations, such as their tendency to reveal solutions rather than support dialogic learning. We introduce ConvoLearn (https://huggingface.co/datasets/masharma/convolearn ), a dataset grounded in knowledge building theory that operationalizes six core pedagogical dimensions: cognitive engagement, formative assessment, accountability, cultural responsiveness, metacognition, and power dynamics. We construct a semi-synthetic dataset of 1250 tutor-student dialogues (20 turns each) in middle school Earth Science through controlled interactions between human teachers and a simulated student. Using QLoRA, we demonstrate that training on this dataset meaningfully shifts LLM behavior toward knowledge-building strategies. Human evaluation by 31 teachers shows our fine-tuned Mistral 7B (M = 4.10, SD = 1.03) significantly outperforms both its base version (M = 2.59, SD = 1.11) and Claude Sonnet 4.5 (M = 2.87, SD = 1.29) overall. This work establishes a potential framework to guide future development and evaluation of constructivist AI tutors.

翻译：在教育应用中，大语言模型（LLMs）表现出若干根本性的教学局限性，例如其倾向于直接给出解决方案而非支持对话式学习。我们介绍了ConvoLearn（https://huggingface.co/datasets/masharma/convolearn），这是一个基于知识建构理论构建的数据集，它将六个核心教学维度操作化：认知参与、形成性评估、责任担当、文化响应性、元认知和权力动态。我们通过人类教师与模拟学生之间的受控交互，构建了一个包含1250个中学地球科学领域导师-学生对话（每个对话20轮）的半合成数据集。利用QLoRA，我们证明了在此数据集上进行训练能有效促使LLM的行为向知识建构策略转变。由31名教师进行的人工评估显示，我们微调后的Mistral 7B模型（M = 4.10，SD = 1.03）在整体表现上显著优于其基础版本（M = 2.59，SD = 1.11）和Claude Sonnet 4.5（M = 2.87，SD = 1.29）。这项工作为未来建构主义AI导师的开发和评估建立了一个潜在的指导框架。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

什么是后训练？大语言模型训练后优化方法综述，87页pdf

专知会员服务

54+阅读 · 2025年3月11日

如何将领域知识注入大模型？最新《将领域特定知识注入大语言模型》综述

专知会员服务

79+阅读 · 2025年2月24日

《大型语言模型持续学习》综述

专知会员服务

93+阅读 · 2024年4月26日

158页《大型语言模型数据集》全面综述，444个数据集涵盖预训练、指令微调、偏好、评估等，附中英文版

专知会员服务

155+阅读 · 2024年3月1日