Stan: An LLM-based thermodynamics course assistant

Discussions of AI in education focus predominantly on student-facing tools -- chatbots, tutors, and problem generators -- while the potential for the same infrastructure to support instructors remains largely unexplored. We describe Stan, a suite of tools for an undergraduate chemical engineering thermodynamics course built on a data pipeline that we develop and deploy in dual roles: serving students and supporting instructors from a shared foundation of lecture transcripts and a structured textbook index. On the student side, a retrieval-augmented generation (RAG) pipeline answers natural-language queries by extracting technical terms, matching them against the textbook index, and synthesizing grounded responses with specific chapter and page references. On the instructor side, the same transcript corpus is processed through structured analysis pipelines that produce per-lecture summaries, identify student questions and moments of confusion, and catalog the anecdotes and analogies used to motivate difficult material -- providing a searchable, semester-scale record of teaching that supports course reflection, reminders, and improvement. All components, including speech-to-text transcription, structured content extraction, and interactive query answering, run entirely on locally controlled hardware using open-weight models (Whisper large-v3, Llama~3.1 8B) with no dependence on cloud APIs, ensuring predictable costs, full data privacy, and reproducibility independent of third-party services. We describe the design, implementation, and practical failure modes encountered when deploying 7--8 billion parameter models for structured extraction over long lecture transcripts, including context truncation, bimodal output distributions, and schema drift, along with the mitigations that resolved them.

翻译：当前关于人工智能在教育中应用的讨论主要集中于面向学生的工具——聊天机器人、辅导系统和习题生成器——而相同基础设施支持教师的潜力在很大程度上仍未得到探索。本文介绍Stan，一套为本科化工热力学课程构建的工具集，其基于我们开发并部署的双重用途数据管道：在共享的讲义转录文本和结构化教材索引基础上，既服务学生又支持教师。在学生端，检索增强生成（RAG）管道通过提取技术术语、匹配教材索引，并综合生成附带具体章节和页码引用的可靠回答，以响应自然语言查询。在教师端，相同的转录文本语料经过结构化分析管道处理，生成每节课的摘要、识别学生疑问与困惑时刻，并分类整理用于阐释难点内容的案例与类比——由此构建可检索的、贯穿整个学期的教学记录，支持课程反思、教学提醒与改进。所有组件（包括语音转文字转录、结构化内容提取和交互式查询应答）均完全在本地可控硬件上运行，使用开放权重模型（Whisper large-v3、Llama~3.1 8B），无需依赖云端API，从而确保成本可控、数据完全私有，且独立于第三方服务的可复现性。我们阐述了在长篇幅讲义转录文本上部署7-80亿参数模型进行结构化提取时的设计思路、实施方案以及实际遇到的故障模式，包括上下文截断、双峰输出分布和模式漂移问题，并介绍了相应的解决策略。