Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. MathNet spans 47 countries, 17 languages, and two decades of competitions, comprising 30,676 expert-authored problems with solutions across diverse domains. In addition to the core dataset, we construct a retrieval benchmark consisting of mathematically equivalent and structurally similar problem pairs curated by human experts. MathNet supports three tasks: (i) Problem Solving, (ii) Math-Aware Retrieval, and (iii) Retrieval-Augmented Problem Solving. Experimental results show that even state-of-the-art reasoning models (78.4% for Gemini-3.1-Pro and 69.3% for GPT-5) remain challenged, while embedding models struggle to retrieve equivalent problems. We further show that retrieval-augmented generation performance is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark. MathNet provides the largest high-quality Olympiad dataset together with the first benchmark for evaluating mathematical problem retrieval, and we publicly release both the dataset and benchmark at https://mathnet.mit.edu.
翻译:数学问题求解仍是大语言模型和多模态模型在推理能力上的重大考验,然而现有基准在规模、语言覆盖范围和任务多样性方面存在局限。我们提出MathNet——一个高质量、大规模、多模态且多语言的奥林匹克级数学问题数据集,以及一个用于评估生成模型数学推理能力和基于嵌入系统数学检索能力的基准。MathNet涵盖47个国家、17种语言,横跨二十年的竞赛题目,包含30,676道专家撰写的问题及其解决方案,覆盖多个领域。除核心数据集外,我们构建了一个由人类专家精选的数学等价与结构相似问题对组成的检索基准。MathNet支持三项任务:(i)问题求解,(ii)数学感知检索,以及(iii)检索增强的问题求解。实验结果表明,即使最先进的推理模型(Gemini-3.1-Pro为78.4%,GPT-5为69.3%)仍面临挑战,而嵌入模型在检索等价问题方面表现不佳。我们进一步表明,检索增强生成的性能对检索质量高度敏感;例如,DeepSeek-V3.2-Speciale实现了高达12%的提升,在该基准上取得了最高分数。MathNet提供了最大的高质量奥林匹克数据集,以及首个评估数学问题检索的基准。我们在https://mathnet.mit.edu上公开提供该数据集和基准。