OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence

Feilong Tang,Xiang An,Yunyao Yan,Yin Xie,Bin Qin,Kaicheng Yang,Yifei Shen,Yuanhan Zhang,Chunyuan Li,Shikun Feng,Changrui Chen,Huajie Tan,Ming Hu,Manyuan Zhang,Bo Li,Ziyong Feng,Ziwei Liu,Zongyuan Ge,Jiankang Deng

Hypothesis. Artificial general intelligence is, at its core, a compression problem. Effective compression demands resonance: deep learning scales best when its architecture aligns with the fundamental structure of the data. These are the fundamental principles. Yet, modern vision architectures have strayed from these truths: visual signals are highly redundant, while discriminative information, the surprise, is sparse. Current models process dense pixel grids uniformly, wasting vast compute on static background rather than focusing on the predictive residuals that define motion and meaning. We argue that to solve visual understanding, we must align our architectures with the information-theoretic principles of video, i.e., Codecs. Method. OneVision-Encoder encodes video by compressing predictive visual structure into semantic meaning. By adopting Codec Patchification, OV-Encoder abandons uniform computation to focus exclusively on the 3.1%-25% of regions rich in signal entropy. To unify spatial and temporal reasoning under irregular token layouts, OneVision-Encoder employs a shared 3D RoPE and is trained with a large-scale cluster discrimination objective over more than one million semantic concepts, jointly capturing object permanence and motion dynamics. Evidence. The results validate our core hypothesis: efficiency and accuracy are not a trade-off; they are positively correlated. When integrated into LLM, it consistently outperforms strong vision backbones such as Qwen3-ViT and SigLIP2 across 16 image, video, and document understanding benchmarks, despite using substantially fewer visual tokens and pretraining data. Notably, on video understanding tasks, OV-Encoder achieves an average improvement of 4.1% over Qwen3-ViT. Codec-aligned, patch-level sparsity is a foundational principle, enabling OV-Encoder as a scalable engine for next-generation visual generalists.

翻译：假设。人工通用智能的核心是一个压缩问题。有效的压缩需要共振：当深度学习架构与数据的根本结构对齐时，其扩展性最佳。这些是基本原则。然而，现代视觉架构已偏离了这些真理：视觉信号高度冗余，而判别性信息，即"惊喜"，是稀疏的。当前模型统一处理密集的像素网格，将大量计算浪费在静态背景上，而非聚焦于定义运动和意义的预测残差。我们认为，要解决视觉理解问题，必须使我们的架构与视频的信息论原则（即编解码器）对齐。方法。OneVision-Encoder 通过将预测性视觉结构压缩为语义意义来编码视频。通过采用编解码器分块化，OV-Encoder 摒弃了均匀计算，专注于仅占 3.1%-25% 的富含信号熵的区域。为了在不规则令牌布局下统一空间和时间推理，OneVision-Encoder 采用了共享的 3D RoPE，并通过针对超过一百万个语义概念的大规模聚类判别目标进行训练，共同捕捉物体恒常性和运动动态。证据。结果验证了我们的核心假设：效率与准确性并非权衡关系，而是正相关的。当集成到 LLM 中时，尽管使用了显著更少的视觉令牌和预训练数据，它在 16 个图像、视频和文档理解基准测试中始终优于 Qwen3-ViT 和 SigLIP2 等强大的视觉骨干网络。值得注意的是，在视频理解任务上，OV-Encoder 相比 Qwen3-ViT 平均提升了 4.1%。编解码器对齐的、块级稀疏性是一个基础原则，使 OV-Encoder 能够作为下一代通用视觉模型的可扩展引擎。