In this paper we summarize the results of the Putnam-like benchmark published by Google DeepMind. This dataset consists of 96 original problems in the spirit of the Putnam Competition and 576 solutions of LLMs. We analyse the performance of models on this set of problems to verify their ability to solve problems from mathematical contests.
翻译:本文总结了由Google DeepMind发布的普特南式基准测试结果。该数据集包含96道普特南竞赛风格的原创题目以及大型语言模型生成的576份解答。我们通过分析模型在此问题集上的表现,以验证其解决数学竞赛题目的能力。