Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

Vision Language Model (VLM) development has largely relied on scaling model size, which hinders deployment on compute-constrained mobile and edge devices such as smartphones and robots. In this work, we explore the performance limits of compact (e.g., 2B and 8B) VLMs. We challenge the prevailing practice that state-of-the-art VLMs must rely on vision encoders initialized via massive contrastive pretraining (e.g., CLIP/SigLIP). We identify an objective mismatch: contrastive learning, optimized for discrimination, enforces coarse and category-level invariances that suppress fine-grained visual cues needed for dense captioning and complex VLM reasoning. To address this issue, we present Penguin-VL, whose vision encoder is initialized from a text-only LLM. Our experiments reveal that Penguin-Encoder serves as a superior alternative to traditional contrastive pretraining, unlocking a higher degree of visual fidelity and data efficiency for multimodal understanding. Across various image and video benchmarks, Penguin-VL achieves performance comparable to leading VLMs (e.g., Qwen3-VL) in mathematical reasoning and surpasses them in tasks such as document understanding, visual knowledge, and multi-perspective video understanding. Notably, these gains are achieved with a lightweight architecture, demonstrating that improved visual representation rather than model scaling is the primary driver of performance. Our ablations show that Penguin-Encoder consistently outperforms contrastive-pretrained encoders, preserving fine-grained spatial and temporal cues that are critical for dense perception and complex reasoning. This makes it a strong drop-in alternative for compute-efficient VLMs and enables high performance in resource-constrained settings. Code: https://github.com/tencent-ailab/Penguin-VL

翻译：视觉语言模型（VLM）的发展在很大程度上依赖于模型规模的扩展，这阻碍了其在计算资源受限的移动和边缘设备（如智能手机和机器人）上的部署。在本工作中，我们探索了紧凑型（例如2B和8B参数规模）VLM的性能极限。我们挑战了当前的主流实践，即最先进的VLM必须依赖通过大规模对比预训练（例如CLIP/SigLIP）初始化的视觉编码器。我们发现了一个目标不匹配问题：为判别任务优化的对比学习，强制了粗粒度和类别级别的不变性，这抑制了密集描述和复杂VLM推理所需的细粒度视觉线索。为了解决这个问题，我们提出了Penguin-VL，其视觉编码器是从一个纯文本LLM初始化而来。我们的实验表明，Penguin-Encoder可作为传统对比预训练的优越替代方案，为多模态理解解锁了更高程度的视觉保真度和数据效率。在各类图像和视频基准测试中，Penguin-VL在数学推理任务上取得了与领先VLM（例如Qwen3-VL）相当的性能，并在文档理解、视觉知识以及多视角视频理解等任务中超越了它们。值得注意的是，这些性能增益是通过轻量级架构实现的，表明改进的视觉表征而非模型缩放是性能提升的主要驱动力。我们的消融实验表明，Penguin-Encoder持续优于对比预训练的编码器，保留了对于密集感知和复杂推理至关重要的细粒度空间与时间线索。这使其成为计算高效VLM的一个强大即插即用替代方案，并能在资源受限的环境中实现高性能。代码：https://github.com/tencent-ailab/Penguin-VL