STEP3-VL-10B Technical Report

Ailin Huang,Chengyuan Yao,Chunrui Han,Fanqi Wan,Hangyu Guo,Haoran Lv,Hongyu Zhou,Jia Wang,Jian Zhou,Jianjian Sun,Jingcheng Hu,Kangheng Lin,Liang Zhao,Mitt Huang,Song Yuan,Wenwen Qu,Xiangfeng Wang,Yanlin Lai,Yingxiu Zhao,Yinmin Zhang,Yukang Shi,Yuyang Chen,Zejia Weng,Ziyang Meng,Ang Li,Aobo Kong,Bo Dong,Changyi Wan,David Wang,Di Qi,Dingming Li,En Yu,Guopeng Li,Haiquan Yin,Han Zhou,Hanshan Zhang,Haolong Yan,Hebin Zhou,Hongbo Peng,Jiaran Zhang,Jiashu Lv,Jiayi Fu,Jie Cheng,Jie Zhou,Jisheng Yin,Jingjing Xie,Jingwei Wu,Jun Zhang,Junfeng Liu,Kaijun Tan,Kaiwen Yan,Liangyu Chen,Lina Chen,Mingliang Li,Qian Zhao,Quan Sun,Shaoliang Pang,Shengjie Fan,Shijie Shang,Siyuan Zhang,Tianhao You,Wei Ji,Wuxun Xie,Xiaobo Yang,Xiaojie Hou,Xiaoran Jiao,Xiaoxiao Ren,Xiangwen Kong,Xin Huang,Xin Wu,Xing Chen,Xinran Wang,Xuelin Zhang,Yana Wei,Yang Li,Yanming Xu,Yeqing Shen,Yuang Peng,Yue Peng,Yu Zhou,Yusheng Li,Yuxiang Yang,Yuyang Zhang,Zhe Xie,Zhewei Huang,Zhenyi Lu,Zhimin Fan,Zihui Cheng,Daxin Jiang,Qi Han,Xiangyu Zhang,Yibo Zhu,Zheng Ge

from arxiv, 50 pages

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.

翻译：本文介绍了STEP3-VL-10B，一个轻量级的开源基础模型，旨在重新定义紧凑效率与前沿级多模态智能之间的权衡。STEP3-VL-10B通过两个战略转变得以实现：首先，采用一种统一的、完全解冻的预训练策略，在1.2万亿多模态令牌上，将语言对齐的感知编码器与Qwen3-8B解码器集成，以建立内在的视觉-语言协同；其次，采用一个规模化后训练流程，包含超过一千次强化学习迭代。至关重要的是，我们实施了并行协调推理（PaCoRe）来扩展测试时计算，将资源分配给可扩展的感知推理，以探索并综合多样化的视觉假设。因此，尽管其模型规模仅为紧凑的100亿参数，STEP3-VL-10B的性能却能与或超越规模大10至20倍的模型（例如GLM-4.6V-106B、Qwen3-VL-235B）以及顶级的专有旗舰模型，如Gemini 2.5 Pro和Seed-1.5-VL。该模型提供了同类最佳的卓越性能，在MMBench上达到92.2%，在MMMU上达到80.11%，同时在复杂推理方面表现出色，在AIME2025上达到94.43%，在MathVision上达到75.95%。我们发布完整的模型套件，旨在为社区提供一个强大、高效且可复现的基准。