We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.
翻译:本文介绍了STEP3-VL-10B,一个轻量级的开源基础模型,旨在重新定义紧凑效率与前沿级多模态智能之间的权衡。STEP3-VL-10B通过两个战略转变得以实现:首先,采用一种统一的、完全解冻的预训练策略,在1.2万亿多模态令牌上,将语言对齐的感知编码器与Qwen3-8B解码器集成,以建立内在的视觉-语言协同;其次,采用一个规模化后训练流程,包含超过一千次强化学习迭代。至关重要的是,我们实施了并行协调推理(PaCoRe)来扩展测试时计算,将资源分配给可扩展的感知推理,以探索并综合多样化的视觉假设。因此,尽管其模型规模仅为紧凑的100亿参数,STEP3-VL-10B的性能却能与或超越规模大10至20倍的模型(例如GLM-4.6V-106B、Qwen3-VL-235B)以及顶级的专有旗舰模型,如Gemini 2.5 Pro和Seed-1.5-VL。该模型提供了同类最佳的卓越性能,在MMBench上达到92.2%,在MMMU上达到80.11%,同时在复杂推理方面表现出色,在AIME2025上达到94.43%,在MathVision上达到75.95%。我们发布完整的模型套件,旨在为社区提供一个强大、高效且可复现的基准。