We curate a comprehensive dataset of 4,550 questions and solutions from problem sets, midterm exams, and final exams across all MIT Mathematics and Electrical Engineering and Computer Science (EECS) courses required for obtaining a degree. We evaluate the ability of large language models to fulfill the graduation requirements for any MIT major in Mathematics and EECS. Our results demonstrate that GPT-3.5 successfully solves a third of the entire MIT curriculum, while GPT-4, with prompt engineering, achieves a perfect solve rate on a test set excluding questions based on images. We fine-tune an open-source large language model on this dataset. We employ GPT-4 to automatically grade model responses, providing a detailed performance breakdown by course, question, and answer type. By embedding questions in a low-dimensional space, we explore the relationships between questions, topics, and classes and discover which questions and classes are required for solving other questions and classes through few-shot learning. Our analysis offers valuable insights into course prerequisites and curriculum design, highlighting language models' potential for learning and improving Mathematics and EECS education.
翻译:我们整理了一个包含4550道题目及解答的综合性数据集,这些题目来自麻省理工学院数学与电气工程与计算机科学(EECS)所有学位必修课程的习题集、期中考试和期末考试。我们评估了大语言模型完成麻省理工学院数学与EECS专业任意主修课程毕业要求的能力。结果表明,GPT-3.5能成功解答整个麻省理工学院课程体系中三分之一的题目,而经过提示工程优化的GPT-4在排除基于图像的题目后的测试集上实现了完美的解答率。我们基于该数据集微调了一个开源大语言模型,并利用GPT-4自动评估模型输出,按课程、题目类型和答案类型提供详细的性能分析。通过将题目嵌入到低维空间中,我们探索了题目、主题和课程之间的关联,并利用少样本学习发现了解答其他题目和课程所需的先修题目与课程。我们的分析为课程先修要求和课程设计提供了宝贵见解,揭示了大语言模型在学习和改进数学与EECS教育方面的潜力。