We investigate the mathematical capabilities of ChatGPT by testing it on publicly available datasets, as well as hand-crafted ones, and measuring its performance against other models trained on a mathematical corpus, such as Minerva. We also test whether ChatGPT can be a useful assistant to professional mathematicians by emulating various use cases that come up in the daily professional activities of mathematicians (question answering, theorem searching). In contrast to formal mathematics, where large databases of formal proofs are available (e.g., the Lean Mathematical Library), current datasets of natural-language mathematics, used to benchmark language models, only cover elementary mathematics. We address this issue by introducing a new dataset: GHOSTS. It is the first natural-language dataset made and curated by working researchers in mathematics that (1) aims to cover graduate-level mathematics and (2) provides a holistic overview of the mathematical capabilities of language models. We benchmark ChatGPT on GHOSTS and evaluate performance against fine-grained criteria. We make this new dataset publicly available to assist a community-driven comparison of ChatGPT with (future) large language models in terms of advanced mathematical comprehension. We conclude that contrary to many positive reports in the media (a potential case of selection bias), ChatGPT's mathematical abilities are significantly below those of an average mathematics graduate student. Our results show that ChatGPT often understands the question but fails to provide correct solutions. Hence, if your goal is to use it to pass a university exam, you would be better off copying from your average peer!
翻译:我们通过在公开数据集及自建数据集上测试ChatGPT,并将其与在数学语料上训练的其他模型(如Minerva)进行性能对比,系统研究了ChatGPT的数学能力。同时,我们通过模拟数学家日常专业活动中出现的多种使用场景(如问答、定理搜索),检验了ChatGPT能否成为专业数学家的有效助手。与形式化数学领域(如Lean数学库)拥有大量形式化证明数据库不同,当前用于评估语言模型性能的自然语言数学数据集仅涵盖初等数学。为解决这一问题,我们提出了一个新数据集:GHOSTS。这是首个由在职数学研究者制作并维护的自然语言数据集,具有以下特点:(1) 旨在覆盖研究生水平的数学内容;(2) 提供语言模型数学能力的整体评估。我们基于GHOSTS对ChatGPT进行基准测试,并根据精细粒度标准评估其性能。我们将这一新数据集公开发布,以促进社区驱动的ChatGPT与(未来)大型语言模型在高级数学理解能力方面的对比研究。研究结论表明,与许多媒体的正面报道(可能存在选择性偏差)相反,ChatGPT的数学能力显著低于普通数学研究生水平。我们的结果显示,ChatGPT经常能理解问题,但无法提供正确解答。因此,如果你试图用它通过大学考试,还不如抄你普通同学的答案!