We investigate the mathematical capabilities of two iterations of ChatGPT (released 9-January-2023 and 30-January-2023) and of GPT-4 by testing them on publicly available datasets, as well as hand-crafted ones, using a novel methodology. In contrast to formal mathematics, where large databases of formal proofs are available (e.g., the Lean Mathematical Library), current datasets of natural-language mathematics, used to benchmark language models, either cover only elementary mathematics or are very small. We address this by publicly releasing two new datasets: GHOSTS and miniGHOSTS. These are the first natural-language datasets curated by working researchers in mathematics that (1) aim to cover graduate-level mathematics, (2) provide a holistic overview of the mathematical capabilities of language models, and (3) distinguish multiple dimensions of mathematical reasoning. These datasets also test whether ChatGPT and GPT-4 can be helpful assistants to professional mathematicians by emulating use cases that arise in the daily professional activities of mathematicians. We benchmark the models on a range of fine-grained performance metrics. For advanced mathematics, this is the most detailed evaluation effort to date. We find that ChatGPT can be used most successfully as a mathematical assistant for querying facts, acting as a mathematical search engine and knowledge base interface. GPT-4 can additionally be used for undergraduate-level mathematics but fails on graduate-level difficulty. Contrary to many positive reports in the media about GPT-4 and ChatGPT's exam-solving abilities (a potential case of selection bias), their overall mathematical performance is well below the level of a graduate student. Hence, if your goal is to use ChatGPT to pass a graduate-level math exam, you would be better off copying from your average peer!
翻译:我们通过使用新方法在公开数据集以及手工定制数据集上测试两个版本的ChatGPT(发布于2023年1月9日和2023年1月30日)及GPT-4,探究其数学能力。与拥有大量形式化证明数据库(如Lean数学库)的形式化数学不同,当前用于评估语言模型的自然语言数学数据集要么仅覆盖初等数学,要么规模极小。为解决这一问题,我们公开发布了两个新数据集:GHOSTS和miniGHOSTS。这些是首个由在职数学研究者整理的自然语言数据集,旨在:(1)覆盖研究生级别的数学内容;(2)全面评估语言模型的数学能力;(3)区分数学推理的多个维度。这些数据集还通过模拟数学家日常专业活动中的用例,测试ChatGPT和GPT-4能否成为专业数学家的有用助手。我们在多种细粒度性能指标上对模型进行基准测试。对于高等数学,这是迄今最详细的评估工作。我们发现,ChatGPT作为数学助手最成功的应用场景是查询事实,充当数学搜索引擎和知识库接口。GPT-4虽能处理本科数学,但在研究生难度上表现不佳。与媒体对GPT-4和ChatGPT应试能力的诸多正面报道(可能存在的选择偏差案例)相反,它们的整体数学水平远低于研究生水平。因此,如果你打算用ChatGPT通过研究生数学考试,不如直接抄袭普通同学的成绩!