A systematic evaluation of vision-language models for observational astronomical reasoning tasks

Vision-language models (VLMs) are increasingly proposed as general-purpose tools for scientific data interpretation, yet their reliability on real astronomical observations across diverse modalities remains untested. We present AstroVLBench, a comprehensive benchmark comprising over 4,100 expert-verified instances across five tasks spanning optical imaging, radio interferometry, multi-wavelength photometry, time-domain light curves, and optical spectroscopy. Evaluating six frontier models, we find that performance is strongly modality-dependent: while one model (Gemini 3 Pro) emerges as the most consistently capable across tasks, task-specific strengths vary, and all models substantially underperform domain-specialized methods. Mechanistic ablations reveal that performance depends not only on directing attention to salient visual features but also on grounding those features in physical knowledge. Phenomenological prompts describing what to look for improve accuracy by sharpening model focus, but physical prompts explaining why those features matter perform better overall and yield more balanced classifications with reduced class-specific bias. Consistent with this picture, presenting the underlying one-dimensional measurements directly as numerical tables instead of rendered plots yields up to 13 percentage points improvement. Reasoning quality analysis further demonstrates that, without explicit physical grounding, models may reach correct predictions from phenomenologically plausible cues while providing physically imprecise justifications, establishing that accuracy alone is insufficient for trustworthy scientific deployment. These findings provide the first systematic, multi-modal baselines for VLMs in observational astronomy and identify the specific representation, grounding, and reasoning bottlenecks where current models fail.

翻译：视觉-语言模型（VLM）正被日益推广为科学数据解读的通用工具，但其在跨多种模态的真实天文观测中的可靠性尚待验证。我们提出AstroVLBench综合基准测试集，涵盖光学成像、射电干涉测量、多波段测光、时域光变曲线及光学光谱五大类任务的4100余个专家验证实例。对六个前沿模型的评估显示，其性能呈现明显的模态依赖性：尽管Gemini 3 Pro模型在整体任务连贯性上表现最优，但各模型在特定任务上各有专长，且均显著落后于领域专用方法。机制性消融实验揭示，模型表现不仅依赖于对显著视觉特征的注意力引导，更需将这些特征锚定于物理知识。描述观测特征的引导性提示通过增强模型聚焦能力提升准确率，而阐释特征物理意义的提示总体表现更优，能产生更均衡的分类结果并减少类别特异性偏差。与此一致的是，直接将底层一维测量数据以数值表格而非渲染图表形式呈现，可使准确率提升达13个百分点。推理质量分析进一步表明，缺乏显式物理支撑时，模型可能通过现象学可解释线索获得正确预测，但其论证依据缺乏物理精确性，证实单纯准确性不足以保证科学应用中值得信赖的部署。这些发现为观测天文学领域的VLM建立了首个系统性多模态基准，并精准识别出现有模型在表征、锚定与推理环节的关键瓶颈。