For middle-school math students, interactive question-answering (QA) with tutors is an effective way to learn. The flexibility and emergent capabilities of generative large language models (LLMs) has led to a surge of interest in automating portions of the tutoring process - including interactive QA to support conceptual discussion of mathematical concepts. However, LLM responses to math questions can be incorrect or mismatched to the educational context - such as being misaligned with a school's curriculum. One potential solution is retrieval-augmented generation (RAG), which involves incorporating a vetted external knowledge source in the LLM prompt to increase response quality. In this paper, we designed prompts that retrieve and use content from a high-quality open-source math textbook to generate responses to real student questions. We evaluate the efficacy of this RAG system for middle-school algebra and geometry QA by administering a multi-condition survey, finding that humans prefer responses generated using RAG, but not when responses are too grounded in the textbook content. We argue that while RAG is able to improve response quality, designers of math QA systems must consider trade-offs between generating responses preferred by students and responses closely matched to specific educational resources.
翻译:对于初中数学学生而言,与导师进行交互式问答(QA)是一种有效的学习方式。生成式大语言模型(LLM)的灵活性和涌现能力引发了自动化部分辅导过程的浓厚兴趣,包括支持数学概念讨论的交互式问答。然而,LLM对数学问题的回答可能是错误的,或与教育背景不匹配——例如与学校课程不一致。一种潜在解决方案是检索增强生成(RAG),即通过将经过审查的外部知识源整合到LLM提示中,以提高回答质量。在本文中,我们设计了能够检索并使用高质量开源数学教科书内容的提示,以生成对真实学生问题的回答。通过开展多条件调查,我们评估了此RAG系统在初中代数和几何问答中的有效性。结果发现,人类更偏好使用RAG生成的回答,但当回答过度基于教科书内容时,偏好程度下降。我们认为,尽管RAG能够提升回答质量,但数学问答系统的设计者必须权衡生成学生偏好的回答与生成紧密匹配特定教育资源的回答之间的取舍。