Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments--a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkEQA, an open-source benchmark for evaluating EQA-relevant perceptual primitives under multi-level low-light conditions. DarkEQA isolates the perception bottleneck by evaluating question answering from egocentric observations under controlled degradations, enabling attributable robustness analysis. A key design feature of DarkEQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline. We demonstrate the utility of DarkEQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models. Our analysis systematically reveals VLMs' limitations when operating under these challenging visual conditions. Project website: https://darkeqa-benchmark.github.io/
翻译:视觉语言模型(VLMs)正日益被采纳为具身智能体的核心推理模块。现有基准测试主要评估其在理想、充足光照条件下的能力,然而要实现稳健的全天候运行,模型必须在包括夜间或黑暗环境等低光照条件在内的多种视觉退化场景下保持性能——这一核心需求在很大程度上被忽视了。为应对这一尚未充分探索的挑战,我们提出了DarkEQA,一个用于评估多级低光照条件下具身问答相关感知原语的开源基准。DarkEQA通过评估受控退化条件下基于第一人称视角观测的问答能力,隔离了感知瓶颈,从而支持可归因的鲁棒性分析。DarkEQA的一个关键设计特征是其物理保真度:视觉退化在线性RAW空间中进行建模,模拟基于物理的照度下降和传感器噪声,随后经过一个受图像信号处理器启发的渲染流程。我们通过评估一系列先进的视觉语言模型和低光照图像增强模型,展示了DarkEQA的实用性。我们的分析系统地揭示了视觉语言模型在这些具有挑战性的视觉条件下运行的局限性。项目网站:https://darkeqa-benchmark.github.io/