We present GLM-4.1V-Thinking, GLM-4.5V, and GLM-4.6V, a family of vision-language models (VLMs) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document interpretation. In a comprehensive evaluation across 42 public benchmarks, GLM-4.5V achieves state-of-the-art performance on nearly all tasks among open-source models of similar size, and demonstrates competitive or even superior results compared to closed-source models such as Gemini-2.5-Flash on challenging tasks including Coding and GUI Agents. Meanwhile, the smaller GLM-4.1V-9B-Thinking remains highly competitive-achieving superior results to the much larger Qwen2.5-VL-72B on 29 benchmarks. We open-source both GLM-4.1V-9B-Thinking and GLM-4.5V. We further introduce the GLM-4.6V series, open-source multimodal models with native tool use and a 128K context window. A brief overview is available at https://z.ai/blog/glm-4.6v. Code, models and more information are released at https://github.com/zai-org/GLM-V.
翻译:我们介绍了 GLM-4.1V-Thinking、GLM-4.5V 和 GLM-4.6V,这是一个旨在推进通用多模态理解与推理能力的视觉-语言模型(VLM)系列。本报告分享了我们在开发以推理为中心的训练框架过程中的关键发现。我们首先通过大规模预训练开发了一个具有显著潜力的强大视觉基础模型,这无疑为最终性能设定了上限。随后,我们提出了课程采样强化学习(RLCS)方法,以充分释放模型的潜力,从而在包括 STEM 问题求解、视频理解、内容识别、编程、基础任务、基于图形用户界面的智能体以及长文档解读在内的多样化任务上实现全面的能力提升。在涵盖 42 个公开基准的综合评估中,GLM-4.5V 在几乎所有任务上都达到了同规模开源模型中的最先进性能,并且在编程和图形用户界面智能体等具有挑战性的任务上,相较于 Gemini-2.5-Flash 等闭源模型也展现出竞争力甚至更优的结果。与此同时,较小的 GLM-4.1V-9B-Thinking 模型依然极具竞争力——在 29 个基准测试中取得了优于大得多的 Qwen2.5-VL-72B 模型的结果。我们开源了 GLM-4.1V-9B-Thinking 和 GLM-4.5V。我们进一步介绍了 GLM-4.6V 系列,这是一组具备原生工具使用能力和 128K 上下文窗口的开源多模态模型。简要概述请访问 https://z.ai/blog/glm-4.6v。代码、模型及更多信息发布于 https://github.com/zai-org/GLM-V。