The automated extraction of structured questions from paper-based mathematics exams is fundamental to intelligent education, yet remains challenging in real-world settings due to severe visual noise. Existing benchmarks mainly focus on clean documents or generic layout analysis, overlooking both the structural integrity of mathematical problems and the ability of models to actively reject incomplete inputs. We introduce MathDoc, the first benchmark for document-level information extraction from authentic high school mathematics exam papers. MathDoc contains \textbf{3,609} carefully curated questions with real-world artifacts and explicitly includes unrecognizable samples to evaluate active refusal behavior. We propose a multi-dimensional evaluation framework covering stem accuracy, visual similarity, and refusal capability. Experiments on SOTA MLLMs, including Qwen3-VL and Gemini-2.5-Pro, show that although end-to-end models achieve strong extraction performance, they consistently fail to refuse illegible inputs, instead producing confident but invalid outputs. These results highlight a critical gap in current MLLMs and establish MathDoc as a benchmark for assessing model reliability under degraded document conditions. Our project repository is available at \href{https://github.com/winnk123/papers/tree/master}{GitHub repository}
翻译:从纸质数学试卷中自动提取结构化题目是智能教育的基础,但在真实场景中,由于严重的视觉噪声,这仍然是一项挑战。现有基准主要关注清洁文档或通用版面分析,忽略了数学问题的结构完整性以及模型主动拒绝不完整输入的能力。我们提出了MathDoc,这是首个针对真实高中数学试卷进行文档级信息抽取的基准测试。MathDoc包含\textbf{3,609}道精心筛选的带有真实场景噪声的题目,并明确包含不可识别样本以评估主动拒答行为。我们提出了一个涵盖题干准确性、视觉相似性和拒答能力的多维度评估框架。在包括Qwen3-VL和Gemini-2.5-Pro在内的先进多模态大语言模型上的实验表明,尽管端到端模型在抽取性能上表现强劲,但它们始终无法拒绝难以辨认的输入,反而会生成自信但无效的输出。这些结果凸显了当前多模态大语言模型的一个关键缺陷,并确立了MathDoc作为评估模型在文档退化条件下可靠性的基准。我们的项目仓库可在\href{https://github.com/winnk123/papers/tree/master}{GitHub仓库}获取。