Ensuring that Multimodal Large Language Models (MLLMs) maintain consistency in their responses is essential for developing trustworthy multimodal intelligence. However, existing benchmarks include many samples where all MLLMs \textit{exhibit high response uncertainty when encountering misleading information}, requiring even 5-15 response attempts per sample to effectively assess uncertainty. Therefore, we propose a two-stage pipeline: first, we collect MLLMs' responses without misleading information, and then gather misleading ones via specific misleading instructions. By calculating the misleading rate, and capturing both correct-to-incorrect and incorrect-to-correct shifts between the two sets of responses, we can effectively metric the model's response uncertainty. Eventually, we establish a \textbf{\underline{M}}ultimodal \textbf{\underline{U}}ncertainty \textbf{\underline{B}}enchmark (\textbf{MUB}) that employs both explicit and implicit misleading instructions to comprehensively assess the vulnerability of MLLMs across diverse domains. Our experiments reveal that all open-source and close-source MLLMs are highly susceptible to misleading instructions, with an average misleading rate exceeding 86\%. To enhance the robustness of MLLMs, we further fine-tune all open-source MLLMs by incorporating explicit and implicit misleading data, which demonstrates a significant reduction in misleading rates. Our code is available at: \href{https://github.com/Yunkai696/MUB}{https://github.com/Yunkai696/MUB}
翻译:确保多模态大语言模型(MLLMs)在其响应中保持一致性,对于发展可信赖的多模态智能至关重要。然而,现有基准测试包含大量样本,其中所有MLLMs在遇到误导性信息时均表现出较高的响应不确定性,甚至需要每个样本进行5至15次响应尝试才能有效评估不确定性。为此,我们提出了一个两阶段流程:首先,我们在无误导信息的情况下收集MLLMs的响应,随后通过特定的误导性指令收集误导性响应。通过计算误导率,并捕捉两组响应之间从正确到错误以及从错误到正确的转变,我们可以有效度量模型的响应不确定性。最终,我们建立了一个**多模态不确定性基准**(**MUB**),该基准采用显式和隐式误导性指令,全面评估MLLMs在不同领域的脆弱性。我们的实验表明,所有开源和闭源MLLMs均极易受到误导性指令的影响,平均误导率超过86%。为了增强MLLMs的鲁棒性,我们进一步通过融入显式和隐式误导数据对所有开源MLLMs进行微调,结果显示误导率显著降低。我们的代码发布于:https://github.com/Yunkai696/MUB