Optical Character Recognition (OCR) is a fundamental task for digitizing information, serving as a critical bridge between visual data and textual understanding. While modern Vision-Language Models (VLM) have achieved high accuracy in this domain, they predominantly rely on autoregressive decoding, which becomes computationally expensive and slow for long documents as it requires a sequential forward pass for every generated token. We identify a key opportunity to overcome this bottleneck: unlike open-ended generation, OCR is a highly deterministic task where the visual input strictly dictates a unique output sequence, theoretically enabling efficient, parallel decoding via diffusion models. However, we show that existing masked diffusion models fail to harness this potential; those introduce structural instabilities that are benign in flexible tasks, like captioning, but catastrophic for the rigid, exact-match requirements of OCR. To bridge this gap, we introduce DODO, the first VLM to utilize block discrete diffusion and unlock its speedup potential for OCR. By decomposing generation into blocks, DODO mitigates the synchronization errors of global diffusion. Empirically, our method achieves near state-of-the-art accuracy while enabling up to 3x faster inference compared to autoregressive baselines.
翻译:光学字符识别(OCR)作为数字化信息的基础任务,是视觉数据与文本理解之间的关键桥梁。尽管现代视觉语言模型(VLM)在该领域已实现高精度识别,但其主要依赖于自回归解码方式,在长文档处理中因需对每个生成标记进行顺序前向传播而导致计算成本高昂且速度缓慢。我们发现一个克服此瓶颈的关键机遇:与开放式生成不同,OCR是高度确定性的任务,视觉输入严格决定了唯一的输出序列,理论上可通过扩散模型实现高效的并行解码。然而,我们证明现有掩码扩散模型未能发挥此潜力;这些模型引入的结构不稳定性在图像描述等灵活任务中影响有限,但对OCR严格精确匹配的要求则会产生灾难性后果。为弥补这一差距,我们提出首个采用块离散扩散的视觉语言模型DODO,释放其在OCR任务中的加速潜力。通过将生成过程分解为块操作,DODO有效缓解了全局扩散的同步误差。实验表明,该方法在实现接近最优精度的同时,推理速度较自回归基线提升最高达3倍。