The accurate trust assessment of multimodal large language models (MLLMs) generated predictions, which can enable selective prediction and improve user confidence, is challenging due to the diverse multi-modal input paradigms. We propose Functionally Equivalent Sampling for Trust Assessment (FESTA), a multimodal input sampling technique for MLLMs, that generates an uncertainty measure based on the equivalent and complementary input samplings. The proposed task-preserving sampling approach for uncertainty quantification expands the input space to probe the consistency (through equivalent samples) and sensitivity (through complementary samples) of the model. FESTA uses only input-output access of the model (black-box), and does not require ground truth (unsupervised). The experiments are conducted with various off-the-shelf multi-modal LLMs, on both visual and audio reasoning tasks. The proposed FESTA uncertainty estimate achieves significant improvement (33.3% relative improvement for vision-LLMs and 29.6% relative improvement for audio-LLMs) in selective prediction performance, based on area-under-receiver-operating-characteristic curve (AUROC) metric in detecting mispredictions. The code implementation is open-sourced.
翻译:由于多模态输入范式的多样性,对多模态大语言模型(MLLMs)生成预测结果进行精确的信任评估——这能够实现选择性预测并提升用户置信度——面临显著挑战。本文提出用于信任评估的功能等价采样方法(FESTA),这是一种面向MLLMs的多模态输入采样技术,通过生成等价且互补的输入采样来构建不确定性度量。所提出的任务保持型采样方法通过扩展输入空间来探测模型的一致性(通过等价样本)与敏感性(通过互补样本),从而实现不确定性量化。FESTA仅需模型的输入-输出接口(黑盒访问),且无需真实标注数据(无监督)。实验在多种现成的多模态大语言模型上展开,涵盖视觉与音频推理任务。基于接收者操作特征曲线下面积(AUROC)指标在错误预测检测中的表现,所提出的FESTA不确定性估计在选择性预测性能上取得显著提升(视觉-语言模型相对提升33.3%,音频-语言模型相对提升29.6%)。代码实现已开源。