We present CycliST, a novel benchmark dataset designed to evaluate Video Language Models (VLM) on their ability for textual reasoning over cyclical state transitions. CycliST captures fundamental aspects of real-world processes by generating synthetic, richly structured video sequences featuring periodic patterns in object motion and visual attributes. CycliST employs a tiered evaluation system that progressively increases difficulty through variations in the number of cyclic objects, scene clutter, and lighting conditions, challenging state-of-the-art models on their spatio-temporal cognition. We conduct extensive experiments with current state-of-the-art VLMs, both open-source and proprietary, and reveal their limitations in generalizing to cyclical dynamics such as linear and orbital motion, as well as time-dependent changes in visual attributes like color and scale. Our results demonstrate that present-day VLMs struggle to reliably detect and exploit cyclic patterns, lack a notion of temporal understanding, and are unable to extract quantitative insights from scenes, such as the number of objects in motion, highlighting a significant technical gap that needs to be addressed. More specifically, we find no single model consistently leads in performance: neither size nor architecture correlates strongly with outcomes, and no model succeeds equally well across all tasks. By providing a targeted challenge and a comprehensive evaluation framework, CycliST paves the way for visual reasoning models that surpass the state-of-the-art in understanding periodic patterns.
翻译:我们提出CycliST,这是一个新型基准数据集,旨在评估视频语言模型(VLM)对循环状态转换进行文本推理的能力。通过生成包含物体运动与视觉属性周期性模式的合成、结构丰富的视频序列,CycliST捕捉了现实世界过程的基本特征。该基准采用层级化评估系统,通过改变循环物体数量、场景杂乱程度和光照条件来逐步增加难度,考验当前最优模型的时空认知能力。我们使用当前最先进的VLM(包括开源和专有模型)进行了大量实验,揭示了它们在泛化至线性运动、轨道运动等循环动力学,以及颜色、尺度等视觉属性随时间变化方面的局限性。结果表明,现有VLM难以可靠检测和利用循环模式,缺乏时间概念理解,且无法从场景中提取定量信息(例如运动物体的数量),突显了亟待解决的技术差距。具体而言,没有任何单一模型能持续保持性能领先:模型规模与架构均未与结果产生强相关性,且所有任务上均无模型取得同等成功。通过提供针对性挑战与全面评估框架,CycliST为超越当前最优水平的周期性模式理解视觉推理模型铺平了道路。