Despite their capabilities, Multimodal Large Language Models (MLLMs) may produce plausible but erroneous outputs, hindering reliable deployment. Accurate uncertainty metrics could enable escalation of unreliable queries to human experts or larger models for improved performance. However, existing uncertainty metrics have practical constraints, such as being designed only for specific modalities, reliant on external tools, or computationally expensive. We introduce UMPIRE, a training-free uncertainty quantification framework for MLLMs that works efficiently across various input and output modalities without external tools, relying only on the models' own internal modality features. UMPIRE computes the incoherence-adjusted semantic volume of sampled MLLM responses for a given task instance, effectively capturing both the global semantic diversity of samples and the local incoherence of responses based on internal model confidence. We propose uncertainty desiderata for MLLMs and provide theoretical analysis motivating UMPIRE's design. Extensive experiments show that UMPIRE consistently outperforms baseline metrics in error detection and uncertainty calibration across image, audio, and video-text benchmarks, including adversarial and out-of-distribution settings. We also demonstrate UMPIRE's generalization to non-text output tasks, including image and audio generation.
翻译:尽管多模态大语言模型(MLLMs)具备强大能力,但其可能生成看似合理实则错误的输出,这阻碍了其可靠部署。准确的不确定性度量可将不可靠查询转交人类专家或更大型模型处理,从而提升性能。然而,现有不确定性度量方法存在实际局限:或仅针对特定模态设计、或依赖外部工具、或计算成本高昂。本文提出UMPIRE——一种免训练的多模态大语言模型不确定性量化框架,该框架能高效处理多种输入输出模态,无需外部工具,仅依赖模型内部的多模态特征。UMPIRE通过计算给定任务实例中采样MLLM响应的不连贯性调整语义体积,有效捕捉样本的全局语义多样性及基于模型内部置信度的响应局部不连贯性。我们提出了多模态大语言模型的不确定性需求准则,并通过理论分析论证UMPIRE的设计动机。大量实验表明,在图像、音频和视频-文本基准测试(包括对抗性和分布外场景)中,UMPIRE在错误检测和不确定性校准方面持续优于基线度量方法。我们还验证了UMPIRE在非文本输出任务(包括图像和音频生成)中的泛化能力。