Human perception of visual scenes is inherently temporal. We instinctively recognise whether a fruit is ripening or rotting, whether construction is progressing or being demolished, and approximately how much time separates two photographs of the same subject. Whether large vision-language models (VLMs) share this competence remains an open and practically important question. We introduce CHRONOSIGHT, a rigorously controlled benchmark evaluating five dimensions of visual temporal reasoning: CHRONORANK (chronological ordering of image sequences), CHRONOLOCATE (ordinal stage localisation from a single image), CHRONODELTA (estimation of time elapsed between two images on a logarithmic scale), CHRONOREVERSE (detection of temporally reversed sequences), and CHRONOODD (identification of a temporal outlier within a set). The benchmark comprises 1{,}000 items across eight process families (biological growth, food transformation, physical weathering, construction, environmental change, human ageing, astronomical phenomena, and urban dynamics) spanning timescales from minutes to millennia. We evaluate eight open-source VLMs (500 M to 19 B parameters) under two prompting regimes and collect human performance baselines. Human performance averages 0.89 across tasks; the best open model (Qwen2.5-VL-7B) reaches 0.40 under direct prompting, a gap we term chronological blindness. Lightweight LoRA fine-tuning on 151 examples raises CHRONODELTA accuracy from near-zero to 0.43, transferring zero-shot to related tasks (CHRONOODD: 0.37; CHRONOREVERSE: 0.64)suggesting the bottleneck is partly instruction following rather than visual perception. Benchmark, code, and predictions will be released upon acceptance.
翻译:人类对视觉场景的感知本质上是时序性的。我们本能地识别出水果正在成熟还是腐烂,建筑正在施工还是拆除,以及两张同一主体的照片之间大致相隔多少时间。大型视觉-语言模型(VLM)是否具备这种能力,仍是一个悬而未决且具有实际重要性的问题。我们提出CHRONOSIGHT,一个严格控制的基准测试,用于评估视觉时序推理的五个维度:CHRONORANK(图像序列的时间顺序排序)、CHRONOLOCATE(单张图像中的时序阶段定位)、CHRONODELTA(以对数尺度估计两张图像之间的时间间隔)、CHRONOREVERSE(检测时序反转序列)和CHRONOODD(识别集合中的时序异常值)。该基准包含来自八个过程类别的1,000个条目(生物生长、食物转化、物理风化、建筑施工、环境变化、人类衰老、天文现象和城市动态),时间跨度从分钟到千年。我们评估了八款开源VLM(参数规模从5亿到190亿),在两种提示范式下进行测试,并收集了人类表现基线。人类在所有任务中的平均准确率为0.89;表现最佳的开源模型(Qwen2.5-VL-7B)在直接提示下达到0.40,这一差距被称为“时间盲视”。基于151个示例的轻量级LoRA微调将CHRONODELTA准确率从接近零提升至0.43,并零样本迁移至相关任务(CHRONOODD:0.37;CHRONOREVERSE:0.64),这表明瓶颈部分在于指令遵循能力而非视觉感知。基准测试、代码和预测结果将在录用后公开。