We present DeepSeek-OCR 2 to investigate the feasibility of a novel encoder-DeepEncoder V2-capable of dynamically reordering visual tokens upon image semantics. Conventional vision-language models (VLMs) invariably process visual tokens in a rigid raster-scan order (top-left to bottom-right) with fixed positional encoding when fed into LLMs. However, this contradicts human visual perception, which follows flexible yet semantically coherent scanning patterns driven by inherent logical structures. Particularly for images with complex layouts, human vision exhibits causally-informed sequential processing. Inspired by this cognitive mechanism, DeepEncoder V2 is designed to endow the encoder with causal reasoning capabilities, enabling it to intelligently reorder visual tokens prior to LLM-based content interpretation. This work explores a novel paradigm: whether 2D image understanding can be effectively achieved through two-cascaded 1D causal reasoning structures, thereby offering a new architectural approach with the potential to achieve genuine 2D reasoning. Codes and model weights are publicly accessible at http://github.com/deepseek-ai/DeepSeek-OCR-2.
翻译:本文提出DeepSeek-OCR 2,旨在研究一种新颖的编码器——DeepEncoder V2——能否根据图像语义动态重排视觉标记的可行性。传统的视觉-语言模型在处理图像时,总是以固定的光栅扫描顺序(从左上到右下)并辅以固定的位置编码,将视觉标记输入到大型语言模型中。然而,这与人类的视觉感知方式相悖,后者遵循由内在逻辑结构驱动的、灵活但语义连贯的扫描模式。特别是对于具有复杂布局的图像,人类视觉表现出基于因果关系的序列化处理能力。受此认知机制启发,DeepEncoder V2被设计为赋予编码器因果推理能力,使其能够在基于LLM的内容解释之前,智能地对视觉标记进行重排序。本研究探索了一种新颖的范式:是否可以通过两个级联的一维因果推理结构来有效实现二维图像理解,从而提供一种新的架构方法,该方法具有实现真正二维推理的潜力。代码和模型权重已在 http://github.com/deepseek-ai/DeepSeek-OCR-2 公开。