Language and Vision-Language Models (LLMs/VLMs) have revolutionized the field of AI by their ability to generate human-like text and understand images, but ensuring their reliability is crucial. This paper aims to evaluate the ability of LLMs (GPT4, GPT-3.5, LLaMA2, and PaLM 2) and VLMs (GPT4V and Gemini Pro Vision) to estimate their verbalized uncertainty via prompting. We propose the new Japanese Uncertain Scenes (JUS) dataset, aimed at testing VLM capabilities via difficult queries and object counting, and the Net Calibration Error (NCE) to measure direction of miscalibration. Results show that both LLMs and VLMs have a high calibration error and are overconfident most of the time, indicating a poor capability for uncertainty estimation. Additionally we develop prompts for regression tasks, and we show that VLMs have poor calibration when producing mean/standard deviation and 95% confidence intervals.
翻译:语言模型和视觉语言模型(LLMs/VLMs)通过生成类人文本和理解图像的能力彻底改变了人工智能领域,但确保其可靠性至关重要。本文旨在评估LLMs(GPT4、GPT-3.5、LLaMA2和PaLM 2)和VLMs(GPT4V和Gemini Pro Vision)通过提示估计其语言化不确定性的能力。我们提出了新的日本不确定场景(JUS)数据集,旨在通过困难查询和物体计数测试VLM能力,并提出了净校准误差(NCE)来衡量误校准的方向。结果表明,LLMs和VLMs均具有较高的校准误差,且大多数情况下过度自信,表明其不确定性估计能力较差。此外,我们为回归任务开发了提示,并证明VLMs在生成均值/标准差和95%置信区间时校准效果较差。