Quantum computing calibration depends on interpreting experimental data, and calibration plots provide the most universal human-readable representation for this task, yet no systematic evaluation exists of how well vision-language models (VLMs) interpret them. We introduce QCalEval, the first VLM benchmark for quantum calibration plots: 243 samples across 87 scenario types from 22 experiment families, spanning superconducting qubits and neutral atoms, evaluated on six question types in both zero-shot and in-context learning settings. The best general-purpose zero-shot model reaches a mean score of 72.3, and many open-weight models degrade under multi-image in-context learning, whereas frontier closed models improve substantially. A supervised fine-tuning ablation at the 9-billion-parameter scale shows that SFT improves zero-shot performance but cannot close the multimodal in-context learning gap. As a reference case study, we release NVIDIA Ising Calibration 1, an open-weight model based on Qwen3.5-35B-A3B that reaches 74.7 zero-shot average score.
翻译:量子计算校准依赖于实验数据的解读,而校准图为此任务提供了最通用的人类可读表示形式,但目前尚无系统性评估考察视觉语言模型(VLM)如何理解这些图表。我们提出QCalEval——首个针对量子校准图的VLM基准:涵盖22个实验家族的87种场景类型中的243个样本,跨越超导量子比特与中性原子两种体系,在零样本和上下文学习两种设置下对六类问题进行评估。最优通用零样本模型平均得分达72.3;多图像上下文学习导致许多开源权重模型性能退化,而前沿闭源模型则显著提升。在90亿参数量级的监督微调消融实验中,SFT虽能改进零样本性能,但无法弥合多模态上下文学习差距。作为参考案例研究,我们发布了基于Qwen3.5-35B-A3B的开源权重模型NVIDIA Ising Calibration 1,其零样本平均得分为74.7。