Optical character recognition (OCR) has advanced rapidly with the rise of vision-language models, yet evaluation has remained concentrated on a small cluster of high- and mid-resource scripts. We introduce GlotOCR Bench, a comprehensive benchmark evaluating OCR generalization across 100+ Unicode scripts. Our benchmark comprises clean and degraded image variants rendered from real multilingual texts. Images are rendered using fonts from the Google Fonts repository, shaped with HarfBuzz and rasterized with FreeType, supporting both LTR and RTL scripts. Samples of rendered images were manually reviewed to verify correct rendering across all scripts. We evaluate a broad suite of open-weight and proprietary vision-language models and find that most perform well on fewer than ten scripts, and even the strongest frontier models fail to generalize beyond thirty scripts. Performance broadly tracks script-level pretraining coverage, suggesting that current OCR systems rely on language model pretraining as much as on visual recognition. Models confronted with unfamiliar scripts either produce random noise or hallucinate characters from similar scripts they already know. We release the benchmark and pipeline for reproducibility. Pipeline Code: https://github.com/cisnlp/glotocr-bench, Benchmark: https://hf.co/datasets/cis-lmu/glotocr-bench.
翻译:随着视觉语言模型的发展,光学字符识别(OCR)技术取得了快速进步,但评估仍主要集中在少数高资源和中资源文字上。我们提出GlotOCR基准测试,这是一套全面评估OCR在100多种Unicode文字脚本上泛化能力的基准。该基准包含从真实多语言文本渲染的清晰和退化图像变体。图像使用Google字体库中的字体渲染,经HarfBuzz整形,并由FreeType进行光栅化,同时支持从左向右和从右向左的文字方向。对渲染图像样本进行了人工审查,以验证所有文字脚本的正确渲染。我们评估了多种开源权重和专有视觉语言模型,发现大多数模型在少于十种文字上表现良好,即使最强的前沿模型也无法泛化到超过三十种文字。性能大致与文字级预训练覆盖范围相关,表明当前OCR系统依赖语言模型预训练的程度不亚于视觉识别。面对不熟悉的文字,模型要么产生随机噪声,要么从已熟悉的类似文字中生成虚构字符。我们发布该基准和流程以确保可复现性。流程代码:https://github.com/cisnlp/glotocr-bench,基准:https://hf.co/datasets/cis-lmu/glotocr-bench。