DeepSeek-OCR utilizes an optical 2D mapping approach to achieve high-ratio vision-text compression, claiming to decode text tokens exceeding ten times the input visual tokens. While this suggests a promising solution for the LLM long-context bottleneck, we investigate a critical question: "Visual merit or linguistic crutch - which drives DeepSeek-OCR's performance?" By employing sentence-level and word-level semantic corruption, we isolate the model's intrinsic OCR capabilities from its language priors. Results demonstrate that without linguistic support, DeepSeek-OCR's performance plummets from approximately 90% to 20%. Comparative benchmarking against 13 baseline models reveals that traditional pipeline OCR methods exhibit significantly higher robustness to such semantic perturbations than end-to-end methods. Furthermore, we find that lower visual token counts correlate with increased reliance on priors, exacerbating hallucination risks. Context stress testing also reveals a total model collapse around 10,000 text tokens, suggesting that current optical compression techniques may paradoxically aggravate the long-context bottleneck. This study empirically defines DeepSeek-OCR's capability boundaries and offers essential insights for future optimizations of the vision-text compression paradigm. We release all data, results and scripts used in this study at https://github.com/dududuck00/DeepSeekOCR.
翻译:DeepSeek-OCR采用光学二维映射方法实现高比例视觉-文本压缩,宣称能够解码超过输入视觉标记十倍数量的文本标记。尽管这为大型语言模型的长上下文瓶颈提供了潜在解决方案,我们探究了一个关键问题:"驱动DeepSeek-OCR性能的究竟是视觉优势还是语言依赖?"通过实施句子级与词汇级语义破坏实验,我们将模型固有的光学字符识别能力与其语言先验知识进行分离。实验结果表明,在缺乏语言支持的情况下,DeepSeek-OCR的性能从约90%急剧下降至20%。与13个基线模型的对比基准测试显示,传统流水线式OCR方法对语义干扰的鲁棒性显著优于端到端方法。进一步研究发现,视觉标记数量的减少与模型对先验知识的依赖性增强呈正相关,这会加剧幻觉风险。上下文压力测试还揭示了模型在约10,000个文本标记处出现完全崩溃,表明当前光学压缩技术可能反而会加剧长上下文瓶颈。本研究通过实证方法界定了DeepSeek-OCR的能力边界,并为未来视觉-文本压缩范式的优化提供了关键见解。我们在https://github.com/dududuck00/DeepSeekOCR发布了本研究所用的全部数据、结果与脚本。