Modern generative models have demonstrated the ability to solve challenging mathematical problems. In many real-world settings, however, mathematical solutions must be expressed visually through diagrams, plots, geometric constructions, and structured symbolic layouts, where correctness depends on precise visual composition. This naturally raises the question of whether generative models can still do so when the answer must be rendered visually rather than written in text? To study this problem, we introduce MathGen, a rigorous benchmark of 900 problems spanning seven core domains, each paired with an executable verifier under a Script-as-a-Judge protocol for deterministic and objective evaluation. Experiments on representative open-source and proprietary text-to-image models show that mathematical fidelity remains a major bottleneck: even the best closed-source model reaches only 42.0% overall accuracy, while open-source models achieve just ~ 1-11%, often near 0% on structured tasks. Overall, current T2I models remain far from competent at even elementary mathematical visual generation.
翻译:现代生成模型已展现出解决复杂数学问题的能力。然而,在众多实际应用场景中,数学解答必须以图表、绘图、几何构造及结构化符号布局等视觉形式呈现,其正确性依赖于精确的视觉构成。这自然引发了一个问题:当答案必须通过视觉而非文本形式呈现时,生成模型是否仍能胜任?为研究此问题,我们引入了MathGen,一个包含900个问题的严格基准测试,涵盖七个核心领域,每个问题均配有在"脚本即裁判"协议下可执行的验证器,以实现确定性与客观性评估。针对代表性开源与专有文本到图像模型的实验表明,数学保真度仍是主要瓶颈:即使最优秀的闭源模型整体准确率也仅为42.0%,而开源模型仅达到约1-11%,在结构化任务上常接近0%。总体而言,当前文本到图像模型在生成哪怕基础数学视觉内容方面仍远未达到合格水平。