Compression Tells Intelligence: Visual Coding, Visual Token Technology, and the Unification

"Compression Tells Intelligence", is supported by research in artificial intelligence, particularly concerning (multimodal) large language models (LLMs/MLLMs), where compression efficiency often correlates with improved model performance and capabilities. For compression, classical visual coding based on traditional information theory has developed over decades, achieving great success with numerous international industrial standards widely applied in multimedia (e.g., image/video) systems. Except that, the recent emergingvisual token technology of generative multi-modal large models also shares a similar fundamental objective like visual coding: maximizing semantic information fidelity during the representation learning while minimizing computational cost. Therefore, this paper provides a comprehensive overview of two dominant technique families first -- Visual Coding and Vision Token Technology -- then we further unify them from the aspect of optimization, discussing the essence of compression efficiency and model performance trade-off behind. Next, based on the proposed unified formulation bridging visual coding andvisual token technology, we synthesize bidirectional insights of themselves and forecast the next-gen visual codec and token techniques. Last but not least, we experimentally show a large potential of the task-oriented token developments in the more practical tasks like multimodal LLMs (MLLMs), AI-generated content (AIGC), and embodied AI, as well as shedding light on the future possibility of standardizing a general token technology like the traditional codecs (e.g., H.264/265) with high efficiency for a wide range of intelligent tasks in a unified and effective manner.

翻译：“压缩揭示智能”这一观点得到了人工智能研究的支持，特别是在（多模态）大语言模型领域，压缩效率的提升往往与模型性能和能力的增强相关。就压缩而言，基于传统信息论的经典视觉编码技术已发展数十年，取得了巨大成功，其众多国际工业标准已广泛应用于多媒体（如图像/视频）系统。此外，近期兴起的生成式多模态大模型的视觉标记技术，也与视觉编码有着相似的根本目标：在表示学习过程中最大化语义信息保真度，同时最小化计算成本。因此，本文首先全面概述了视觉编码和视觉标记技术这两大主流技术体系，进而从优化的角度对它们进行统一，探讨其背后压缩效率与模型性能权衡的本质。接着，基于所提出的、连接视觉编码与视觉标记技术的统一表述，我们综合了二者之间的双向洞见，并展望了下一代视觉编解码器与标记技术。最后，我们通过实验展示了面向任务的标记开发在多模态大语言模型、AI生成内容以及具身AI等更具实践性的任务中的巨大潜力，并揭示了未来像传统编解码器（如H.264/265）那样，以统一高效的方式为广泛智能任务标准化通用标记技术的可能性。