Recent work has shown that Vision-Language Models (VLMs) used for optical character recognition (OCR) can generate plausible but visually unsupported text, suggesting reliance on language priors. Comparing open-weight VLMs with traditional OCR baselines on low-resource Ancient Greek critical editions, we show that VLM errors often remain fluent even when wrong, producing plausible Greek substitutions where traditional engines produce local recognition noise. To analyze visual evidence during decoding, we introduce controlled image perturbations and token-level grounding measures based on conditional versus image-free decoding distributions. Under character-level perturbations, VLMs diverge sharply from the perturbed ground truth while traditional OCR remains comparatively faithful; however, token-level analysis shows that prior reliance is model-specific: in an OCR-specialist model, fluent lexical errors are produced with little reliance on the image, whereas general-purpose VLMs remain conditioned on the visual input even when wrong. Decode-time interventions fail to reliably restore grounding, while post-OCR language-model correction improves several systems only by repairing text after generation. Our results extend prior evidence of OCR language-prior reliance to low-resource historical documents and a broader set of models, showing that fluent output is not necessarily visually grounded and motivating interpretability-driven evaluation beyond aggregate accuracy.
翻译:近期研究表明,用于光学字符识别(OCR)的视觉-语言模型(VLM)能够生成看似合理但缺乏视觉支持的文本,暗示其依赖语言先验知识。通过将开放权重VLM与传统OCR基线模型在低资源古希腊文评注本上进行对比,我们证明VLM的错误即使在错误时也常保持流畅性——当传统引擎产生局部识别噪声时,VLM却能生成合理的希腊文替代字符。为分析解码过程中的视觉证据,我们引入受控图像扰动与基于条件分布与无图像解码分布的词元级锚定度量。在字符级扰动下,VLM与扰动后的真实文本出现显著偏差,而传统OCR则相对保持忠实;但词元级分析表明,先验依赖具有模型特异性:在OCR专用模型中,流畅的词汇错误几乎不依赖图像产生,而通用VLM即使在错误时仍保持对视觉输入的依赖。解码时干预措施未能可靠恢复锚定,而OCR后语言模型校正仅通过生成后修补文本改善部分系统性能。我们的研究将先前关于OCR语言先验依赖的证据扩展至低资源历史文献与更广泛的模型类别,证明流畅输出未必具有视觉锚定性,并推动超越聚合精度的可解释性评估。