Vision-Language Models (VLMs) have achieved impressive results on general vision-language tasks, yet they suffer from hallucination, imprecise localization, and prohibitive computational cost when applied to dedicated OCR scenarios. This paper presents PP-OCRv6, a lightweight OCR system that combines architectural innovation with data-centric optimization. PP-OCRv6 redesigns the backbone, detection neck, and recognition neck around a unified MetaFormer-style building block with structural reparameterization, decoupling spatial token mixing from channel mixing and supporting both tasks through task-specific stride configurations. Three model tiers (medium, small, tiny) share the same block primitives, covering deployment scenarios from server to edge. On our in-house benchmarks, PP-OCRv6_medium achieves 83.2% recognition accuracy and 86.2% detection Hmean, outperforming PP-OCRv5_server by +5.1% and +4.6% respectively while surpassing Qwen3-VL-235B, GPT-5.5, and Gemini-3.1-Pro with orders of magnitude fewer parameters. The tiny tier achieves 3.9$\times$ faster inference than PP-OCRv5_mobile on Intel Xeon CPU while maintaining comparable accuracy.
翻译:视觉语言模型(VLM)在通用视觉-语言任务上取得了显著成果,但在应用于专用OCR场景时,仍存在幻觉、定位不精确以及计算成本过高的问题。本文提出PP-OCRv6,一种融合架构创新与数据驱动优化的轻量级OCR系统。PP-OCRv6围绕统一的MetaFormer风格基础模块重新设计了主干网络、检测颈部和识别颈部,该模块采用结构重参数化,将空间标记混合与通道混合解耦,并通过任务特定的步长配置支持两类任务。三个模型层级(中型、小型、微型)共享相同的基础模块原语,覆盖从服务器到边缘的部署场景。在我们内部基准测试中,PP-OCRv6_medium的识别准确率达到83.2%,检测Hmean为86.2%,分别超过PP-OCRv5_server 5.1%和4.6%,同时在参数量低数个数量级的情况下超越Qwen3-VL-235B、GPT-5.5和Gemini-3.1-Pro。微型层级在英特尔至强CPU上的推理速度比PP-OCRv5_mobile快3.9倍,同时保持相近的准确率。