Multimodal large language models (MLLMs) combine visual and textual data for tasks such as image captioning and visual question answering. Proper uncertainty calibration is crucial, yet challenging, for reliable use in areas like healthcare and autonomous driving. This paper investigates representative MLLMs, focusing on their calibration across various scenarios, including before and after visual fine-tuning, as well as before and after multimodal training of the base LLMs. We observed miscalibration in their performance, and at the same time, no significant differences in calibration across these scenarios. We also highlight how uncertainty differs between text and images and how their integration affects overall uncertainty. To better understand MLLMs' miscalibration and their ability to self-assess uncertainty, we construct the IDK (I don't know) dataset, which is key to evaluating how they handle unknowns. Our findings reveal that MLLMs tend to give answers rather than admit uncertainty, but this self-assessment improves with proper prompt adjustments. Finally, to calibrate MLLMs and enhance model reliability, we propose techniques such as temperature scaling and iterative prompt optimization. Our results provide insights into improving MLLMs for effective and responsible deployment in multimodal applications. Code and IDK dataset: \href{https://github.com/hfutml/Calibration-MLLM}{https://github.com/hfutml/Calibration-MLLM}.
翻译:多模态大语言模型(MLLMs)融合视觉与文本数据,用于图像描述和视觉问答等任务。在医疗和自动驾驶等领域实现可靠应用,恰当的不确定性校准至关重要,但也极具挑战性。本文研究了代表性MLLMs,重点关注其在多种场景下的校准情况,包括视觉微调前后,以及基础LLMs进行多模态训练前后。我们观察到其性能存在校准失准现象,同时这些不同场景间的校准并无显著差异。我们还强调了文本与图像间不确定性的差异,以及二者融合如何影响整体不确定性。为更好地理解MLLMs的校准失准及其自我评估不确定性的能力,我们构建了IDK(我不知道)数据集,该数据集是评估模型如何处理未知信息的关键。我们的发现表明,MLLMs倾向于给出答案而非承认不确定性,但通过适当的提示调整,这种自我评估能力可以得到改善。最后,为校准MLLMs并提升模型可靠性,我们提出了温度缩放和迭代提示优化等技术。我们的研究结果为改进MLLMs,以在多模态应用中实现有效且负责任的部署提供了见解。代码与IDK数据集:\href{https://github.com/hfutml/Calibration-MLLM}{https://github.com/hfutml/Calibration-MLLM}。