GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding. It combines a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder, achieving a strong balance between computational efficiency and recognition performance. To address the inefficiency of standard autoregressive decoding in deterministic OCR tasks, GLM-OCR introduces a Multi-Token Prediction (MTP) mechanism that predicts multiple tokens per step, significantly improving decoding throughput while keeping memory overhead low through shared parameters. At the system level, a two-stage pipeline is adopted: PP-DocLayout-V3 first performs layout analysis, followed by parallel region-level recognition. Extensive evaluations on public benchmarks and industrial scenarios show that GLM-OCR achieves competitive or state-of-the-art performance in document parsing, text and formula transcription, table structure recovery, and key information extraction. Its compact architecture and structured generation make it suitable for both resource-constrained edge deployment and large-scale production systems.
翻译:GLM-OCR是一种高效的0.9B参数紧凑型多模态模型,专为真实世界文档理解而设计。它结合了一个0.4B参数的CogViT视觉编码器和一个0.5B参数的GLM语言解码器,在计算效率与识别性能之间实现了良好的平衡。为解决确定性OCR任务中标准自回归解码的低效问题,GLM-OCR引入了多令牌预测机制,该机制每步预测多个令牌,通过共享参数保持较低内存开销的同时,显著提升了解码吞吐量。在系统层面,采用了两阶段流水线:PP-DocLayout-V3首先进行版面分析,随后进行并行区域级识别。在公开基准和工业场景上的广泛评估表明,GLM-OCR在文档解析、文本与公式转录、表格结构恢复以及关键信息提取任务中取得了具有竞争力或最先进的性能。其紧凑的架构和结构化生成特性使其既适用于资源受限的边缘部署,也适用于大规模生产系统。