Large foundation models have achieved significant performance gains through scalable training on massive datasets. However, the field of \textbf{H}andwritten \textbf{M}athematical \textbf{E}xpression \textbf{R}ecognition (HMER) has been impeded by the scarcity of data, primarily due to the arduous and costly process of manual annotation. To bridge this gap, we propose a novel method integrating limited handwritten formulas with large-scale LaTeX-rendered formulas by developing a scalable data engine to generate complex and consistent LaTeX sequences. With this engine, we built the largest formula dataset to date, termed \texttt{Tex80M}, comprising over 80 million high-quality training instances. Then we propose \texttt{TexTeller}, the first HMER model trained at scale, by mix-training \texttt{Tex80M} with a relatively small HME dataset. The expansive training dataset and our refined pipeline have equipped \texttt{TexTeller} with state-of-the-art (SOTA) performance across nearly all benchmarks. To advance the field, we will openly release our complete model, entire dataset, and full codebase, enabling further research building upon our contributions.
翻译:大型基础模型通过在庞大数据集上进行可扩展训练,已取得显著的性能提升。然而,由于人工标注过程艰巨且成本高昂导致数据稀缺,**手写数学表达式识别**领域的发展一直受到阻碍。为弥补这一差距,我们提出一种新颖方法,通过开发可扩展的数据引擎来生成复杂且一致的LaTeX序列,从而将有限的手写公式与大规模LaTeX渲染公式相结合。利用该引擎,我们构建了迄今为止最大的公式数据集,命名为 \texttt{Tex80M},包含超过8000万个高质量训练样本。随后,我们通过将 \texttt{Tex80M} 与相对较小规模的手写数学表达式数据集进行混合训练,提出了首个大规模训练的HMER模型 \texttt{TexTeller}。庞大的训练数据集及我们优化的训练流程使 \texttt{TexTeller} 在几乎所有基准测试中均达到最先进的性能水平。为推动该领域发展,我们将公开完整的模型、全部数据集及完整代码库,以支持基于我们成果的进一步研究。