Since the introduction of the original BERT (i.e., BASE BERT), researchers have developed various customized BERT models with improved performance for specific domains and tasks by exploiting the benefits of transfer learning. Due to the nature of mathematical texts, which often use domain specific vocabulary along with equations and math symbols, we posit that the development of a new BERT model for mathematics would be useful for many mathematical downstream tasks. In this resource paper, we introduce our multi-institutional effort (i.e., two learning platforms and three academic institutions in the US) toward this need: MathBERT, a model created by pre-training the BASE BERT model on a large mathematical corpus ranging from pre-kindergarten (pre-k), to high-school, to college graduate level mathematical content. In addition, we select three general NLP tasks that are often used in mathematics education: prediction of knowledge component, auto-grading open-ended Q&A, and knowledge tracing, to demonstrate the superiority of MathBERT over BASE BERT. Our experiments show that MathBERT outperforms prior best methods by 1.2-22% and BASE BERT by 2-8% on these tasks. In addition, we build a mathematics specific vocabulary 'mathVocab' to train with MathBERT. We discover that MathBERT pre-trained with 'mathVocab' outperforms MathBERT trained with the BASE BERT vocabulary (i.e., 'origVocab'). MathBERT is currently being adopted at the participated leaning platforms: Stride, Inc, a commercial educational resource provider, and ASSISTments.org, a free online educational platform. We release MathBERT for public usage at: https://github.com/tbs17/MathBERT.
翻译:自原始BERT(即BASE BERT)问世以来,研究者通过利用迁移学习的优势,针对特定领域和任务开发了多种性能更优的自定义BERT模型。鉴于数学文本常使用领域专属词汇及公式与数学符号的特性,我们认为开发针对数学领域的新型BERT模型将对多种数学下游任务具有实用价值。本资源论文介绍了我们多机构协作(即美国两家学习平台与三所学术机构)的成果:MathBERT——通过在涵盖从学前至高中及大学研究生阶段的数学内容的大型数学语料库上对BASE BERT进行预训练构建的模型。此外,我们选取数学教育中常用的三项通用自然语言处理任务(知识组件预测、开放式问答自动评分和知识追踪)来证明MathBERT相较BASE BERT的优越性。实验表明,MathBERT在这些任务上比此前最优方法提升1.2%-22%,比BASE BERT提升2%-8%。同时,我们构建了数学专用词表'mathVocab'用于MathBERT训练,发现采用'mathVocab'预训练的MathBERT优于使用BASE BERT词表(即'origVocab')训练的版本。目前MathBERT已在参与合作的两个学习平台(商业教育资源提供商Stride, Inc.与免费在线教育平台ASSISTments.org)中投入使用。我们已将MathBERT开源供公众使用,下载地址为:https://github.com/tbs17/MathBERT。