The accurate trust assessment of multimodal large language models (MLLMs) generated predictions, which can enable selective prediction and improve user confidence, is challenging due to the diverse multi-modal input paradigms. We propose Functionally Equivalent Sampling for Trust Assessment (FESTA), a multimodal input sampling technique for MLLMs, that generates an uncertainty measure based on the equivalent and complementary input samplings. The proposed task-preserving sampling approach for uncertainty quantification expands the input space to probe the consistency (through equivalent samples) and sensitivity (through complementary samples) of the model. FESTA uses only input-output access of the model (black-box), and does not require ground truth (unsupervised). The experiments are conducted with various off-the-shelf multi-modal LLMs, on both visual and audio reasoning tasks. The proposed FESTA uncertainty estimate achieves significant improvement (33.3% relative improvement for vision-LLMs and 29.6% relative improvement for audio-LLMs) in selective prediction performance, based on area-under-receiver-operating-characteristic curve (AUROC) metric in detecting mispredictions. The code implementation is open-sourced.
翻译:准确评估多模态大语言模型(MLLMs)生成预测的可信度对于实现选择性预测和提升用户置信度至关重要,然而多样化的多模态输入范式使这一任务极具挑战性。本文提出面向可信度评估的功能等效采样方法(FESTA),这是一种针对MLLMs的多模态输入采样技术,通过等效与互补的输入采样生成不确定性度量。这种任务保持型的不确定性量化采样方法通过扩展输入空间来探测模型的一致性(通过等效样本)与敏感性(通过互补样本)。FESTA仅需模型的输入-输出访问权限(黑盒设置),且无需真实标签(无监督)。我们在多种现有多模态大语言模型上开展了视觉与音频推理任务的实验。基于接收者操作特征曲线下面积(AUROC)指标在错误预测检测中的表现,所提出的FESTA不确定性估计在选择性预测性能上取得了显著提升(视觉大语言模型相对提升33.3%,音频大语言模型相对提升29.6%)。代码实现已开源。