Optical Character Recognition (OCR) is fundamental to Vision-Language Models (VLMs) and high-quality data generation for LLM training. Yet, despite progress in average OCR accuracy, state-of-the-art VLMs still struggle with detecting sample-level errors and lack effective unsupervised quality control. We introduce Consensus Entropy (CE), a training-free, model-agnostic metric that estimates output reliability by measuring inter-model agreement entropy. The core insight is that correct predictions converge in output space, while errors diverge. Based on CE, we develop CE-OCR, a lightweight multi-model framework that verifies outputs by ensemble agreement, selects the best outputs, and further improves efficiency through adaptive routing. Experiments demonstrate that CE is robust for quality verification, improving F1 scores by 42.1\% over VLM-as-Judge. CE-OCR achieves consistent OCR gains, outperforming self-consistency and single-model baselines at the same cost. Notably, CE requires no training or supervision, enabling plug-and-play integration.
翻译:光学字符识别(OCR)是视觉语言模型(VLM)以及为LLM训练生成高质量数据的基础。然而,尽管平均OCR准确率有所提升,最先进的VLM在检测样本级错误方面仍存在困难,且缺乏有效的无监督质量控制。我们提出共识熵(CE)——一种无需训练、与模型无关的指标,通过测量模型间一致性熵来估计输出可靠性。其核心洞察在于:正确预测在输出空间中收敛,而错误则发散。基于CE,我们开发了CE-OCR——一个轻量级多模型框架,通过集成一致性验证输出、选择最优输出,并通过自适应路由进一步提高效率。实验表明,CE在质量验证方面具有鲁棒性,相比VLM作为评判器的方法,F1分数提升了42.1%。CE-OCR在相同成本下实现了持续的OCR性能提升,优于自一致性和单模型基线。值得注意的是,CE无需训练或监督,可实现即插即用集成。