"How Do I ...?": Procedural Questions Predominate Student-LLM Chatbot Conversations

Providing scaffolding through educational chatbots built on Large Language Models (LLM) has potential risks and benefits that remain an open area of research. When students navigate impasses, they ask for help by formulating impasse-driven questions. Within interactions with LLM chatbots, such questions shape the user prompts and drive the pedagogical effectiveness of the chatbot's response. This paper focuses on such student questions from two datasets of distinct learning contexts: formative self-study, and summative assessed coursework. We analysed 6,113 messages from both learning contexts, using 11 different LLMs and three human raters to classify student questions using four existing schemas. On the feasibility of using LLMs as raters, results showed moderate-to-good inter-rater reliability, with higher consistency than human raters. The data showed that 'procedural' questions predominated in both learning contexts, but more so when students prepare for summative assessment. These results provide a basis on which to use LLMs for classification of student questions. However, we identify clear limitations in both the ability to classify with schemas and the value of doing so: schemas are limited and thus struggle to accommodate the semantic richness of composite prompts, offering only partial understanding the wider risks and benefits of chatbot integration. In the future, we recommend an analysis approach that captures the nuanced, multi-turn nature of conversation, for example, by applying methods from conversation analysis in discursive psychology.

翻译：通过基于大语言模型（LLM）构建的教育聊天机器人提供教学支架，其潜在风险与益处仍是一个开放的研究领域。当学生遇到学习瓶颈时，他们会通过提出瓶颈驱动的问题来寻求帮助。在与LLM聊天机器人的互动中，此类问题塑造了用户提示，并决定了聊天机器人回复的教学有效性。本文聚焦于来自两种不同学习情境数据集中的学生提问：形成性自主学习与总结性评估课程作业。我们分析了来自两种学习情境的6,113条消息，使用11种不同的LLM模型和三位人类评分员，依据四种现有分类框架对学生提问进行分类。关于使用LLM作为评分员的可行性，结果显示评分者间信度处于中等至良好水平，且一致性高于人类评分员。数据显示，“程序性”问题在两种学习情境中均占主导地位，且在学生准备总结性评估时更为突出。这些结果为使用LLM对学生提问进行分类提供了依据。然而，我们明确指出分类框架的分类能力及其应用价值均存在明显局限：现有框架有限，难以涵盖复合提示的语义丰富性，仅能部分理解聊天机器人整合的更广泛风险与益处。未来，我们建议采用能够捕捉对话中细微、多轮次特性的分析方法，例如应用话语心理学中的会话分析方法。