We curate a comprehensive dataset of 4,550 questions and solutions from problem sets, midterm exams, and final exams across all MIT Mathematics and Electrical Engineering and Computer Science (EECS) courses required for obtaining a degree. We evaluate the ability of large language models to fulfill the graduation requirements for any MIT major in Mathematics and EECS. Our results demonstrate that GPT-3.5 successfully solves a third of the entire MIT curriculum, while GPT-4, with prompt engineering, achieves a perfect solve rate on a test set excluding questions based on images. We fine-tune an open-source large language model on this dataset. We employ GPT-4 to automatically grade model responses, providing a detailed performance breakdown by course, question, and answer type. By embedding questions in a low-dimensional space, we explore the relationships between questions, topics, and classes and discover which questions and classes are required for solving other questions and classes through few-shot learning. Our analysis offers valuable insights into course prerequisites and curriculum design, highlighting language models' potential for learning and improving Mathematics and EECS education.
翻译:我们整理了麻省理工学院数学与电子工程及计算机科学(EECS)学位所需全部课程中来自习题集、期中考试和期末考试的4550道题目及其解答,构成综合数据集。我们评估了大型语言模型满足麻省理工学院数学与EECS任一专业毕业要求的能力。结果表明,GPT-3.5能成功解答整个麻省理工学院课程体系中三分之一的题目,而采用提示工程的GPT-4在排除基于图像的题目后,在测试集上达到完美解答率。我们基于该数据集对开源大型语言模型进行微调,并利用GPT-4自动评分模型回答,按课程、题目及答案类型提供详细性能分解。通过将题目嵌入低维空间,我们探究题目、主题与课程间的关联,并借助少样本学习发现解决其他题目与课程所需的先决题目与课程。我们的分析为课程先修要求与课程设计提供了宝贵见解,凸显了语言模型在学习与改进数学与EECS教育方面的潜力。