LaTeX OCR converts scientific document images into editable LaTeX code. Existing systems rely on large paired datasets, which are costly to collect and limited for low-resource languages. This paper presents MIXTEX, a data-efficient system using synthetic pretraining without real LaTeX sources. Unlike Nougat that depends on arXiv datasets, we generate training data by randomly pairing grammatical Wikipedia text with LaTeX formulas, requiring only syntactic correctness. This eliminates dependency on real document collections, enables scalable data generation (120M tokens), and supports low-resource languages. Following synthetic pretraining, adaptation requires only 400 real samples. Evaluation on a 977-sample benchmark with printed and handwritten English and Chinese shows that this two-stage strategy outperforms methods trained on large real datasets while requiring less human effort and computation. Data, code, and models are publicly available.
翻译:LaTeX OCR将科学文档图像转换为可编辑的LaTeX代码。现有系统依赖于大规模配对数据集,这些数据集采集成本高昂,且对低资源语言支持有限。本文提出数据高效系统MIXTEX,采用无需真实LaTeX源的合成预训练方法。与依赖arXiv数据集的Nougat不同,我们通过随机配对符合语法规范的非结构化文本与LaTeX公式生成训练数据,仅需确保语法正确性。该方法消除了对真实文档集合的依赖,支持可扩展数据生成(1.2亿词元),并兼容低资源语言。合成预训练后仅需400个真实样本即可完成领域适配。在包含印刷体及手写体英文与中文的977样本基准测试中,这种两阶段策略在降低人力与计算开销的同时,性能优于基于大规模真实数据集训练的方法。数据、代码及模型均已开源。