In various academic and professional settings, such as mathematics lectures or research presentations, it is often necessary to convey mathematical expressions orally. However, reading mathematical expressions aloud without accompanying visuals can significantly hinder comprehension, especially for those who are hearing-impaired or rely on subtitles due to language barriers. For instance, when a presenter reads Euler's Formula, current Automatic Speech Recognition (ASR) models often produce a verbose and error-prone textual description (e.g., e to the power of i x equals cosine of x plus i $\textit{side}$ of x), instead of the concise $\LaTeX{}$ format (i.e., $ e^{ix} = \cos(x) + i\sin(x) $), which hampers clear understanding and communication. To address this issue, we introduce MathSpeech, a novel pipeline that integrates ASR models with small Language Models (sLMs) to correct errors in mathematical expressions and accurately convert spoken expressions into structured $\LaTeX{}$ representations. Evaluated on a new dataset derived from lecture recordings, MathSpeech demonstrates $\LaTeX{}$ generation capabilities comparable to leading commercial Large Language Models (LLMs), while leveraging fine-tuned small language models of only 120M parameters. Specifically, in terms of CER, BLEU, and ROUGE scores for $\LaTeX{}$ translation, MathSpeech demonstrated significantly superior capabilities compared to GPT-4o. We observed a decrease in CER from 0.390 to 0.298, and higher ROUGE/BLEU scores compared to GPT-4o.
翻译:在数学讲座或学术报告等各类学术与专业场景中,经常需要通过口头方式传达数学表达式。然而,在没有视觉辅助的情况下朗读数学表达式会显著阻碍理解,尤其对于听障人士或因语言障碍依赖字幕的人群。例如,当报告人朗读欧拉公式时,现有的自动语音识别模型通常生成冗长且易出错的文本描述(例如“e的i x次方等于x的余弦加i乘以x的$\textit{side}$”),而非简洁的$\LaTeX{}$格式(即$ e^{ix} = \cos(x) + i\sin(x) $),这严重影响了清晰理解与有效交流。为解决此问题,我们提出MathSpeech——一种将ASR模型与小型语言模型相结合的新型处理流程,能够纠正数学表达式中的错误,并将口语表达式准确转换为结构化的$\LaTeX{}$表示。基于讲座录音构建的新数据集评估表明,MathSpeech在仅使用120M参数的微调小型语言模型条件下,其$\LaTeX{}$生成能力已达到主流商用大型语言模型水平。具体而言,在$\LaTeX{}$翻译的CER、BLEU和ROUGE指标上,MathSpeech展现出显著优于GPT-4o的性能:CER从0.390降至0.298,ROUGE/BLEU分数亦高于GPT-4o。