Modern Vision-Language Models (VLMs) remain poorly characterized in multi-step visual interactions, particularly in how they integrate perception, memory, and action over long horizons. We introduce VisGym, a gymnasium of 17 environments for evaluating and training VLMs. The suite spans symbolic puzzles, real-image understanding, navigation, and manipulation, and provides flexible controls over difficulty, input representation, planning horizon, and feedback. We also provide multi-step solvers that generate structured demonstrations, enabling supervised finetuning. Our evaluations show that all frontier models struggle in interactive settings, achieving low success rates in both the easy (46.6%) and hard (26.0%) configurations. Our experiments reveal notable limitations: models struggle to effectively leverage long context, performing worse with an unbounded history than with truncated windows. Furthermore, we find that several text-based symbolic tasks become substantially harder once rendered visually. However, explicit goal observations, textual feedback, and exploratory demonstrations in partially observable or unknown-dynamics settings for supervised finetuning yield consistent gains, highlighting concrete failure modes and pathways for improving multi-step visual decision-making. Code, data, and models can be found at: https://visgym.github.io/.
翻译:现代视觉-语言模型(VLMs)在多步视觉交互中的特性仍不明确,尤其是在如何整合感知、记忆与行动以应对长时程任务方面。我们提出了VisGym,一个包含17个环境的评测与训练平台,用于评估和训练VLMs。该套件涵盖符号推理、真实图像理解、导航与操作任务,并提供对难度、输入表征、规划时域和反馈机制的灵活控制。我们还提供了可生成结构化演示的多步求解器,以支持监督微调。我们的评估表明,所有前沿模型在交互式场景中均表现不佳,在简单配置(46.6%)和困难配置(26.0%)下的成功率均较低。实验揭示了若干显著局限:模型难以有效利用长上下文,在无限制历史记录下的表现反而差于使用截断窗口的情况。此外,我们发现多个基于文本的符号任务在转换为视觉形式后难度显著增加。然而,在部分可观测或动力学未知的场景中,通过显式的目标观察、文本反馈以及用于监督微调的探索性演示,模型性能获得了稳定提升,这为改进多步视觉决策指明了具体的失效模式与优化路径。代码、数据及模型可通过以下网址获取:https://visgym.github.io/。