Accurate uncertainty quantification is crucial for making reliable decisions in various supervised learning scenarios, particularly when dealing with complex, multimodal data such as images and text. Current approaches often face notable limitations, including rigid assumptions and limited generalizability, constraining their effectiveness across diverse supervised learning tasks. To overcome these limitations, we introduce Generative Score Inference (GSI), a flexible inference framework capable of constructing statistically valid and informative prediction and confidence sets across a wide range of multimodal learning problems. GSI utilizes synthetic samples generated by deep generative models to approximate conditional score distributions, facilitating precise uncertainty quantification without imposing restrictive assumptions about the data or tasks. We empirically validate GSI's capabilities through two representative scenarios: hallucination detection in large language models and uncertainty estimation in image captioning. Our method achieves state-of-the-art performance in hallucination detection and robust predictive uncertainty in image captioning, and its performance is positively influenced by the quality of the underlying generative model. These findings underscore the potential of GSI as a versatile inference framework, significantly enhancing uncertainty quantification and trustworthiness in multimodal learning.
翻译:准确的量化不确定性对于在各类监督学习场景中做出可靠决策至关重要,特别是在处理图像和文本等复杂多模态数据时。当前方法常面临显著局限性,包括僵硬的假设和有限的泛化能力,这制约了其在多样化监督学习任务中的有效性。为克服这些局限,我们提出了生成式得分推断(GSI)——一种灵活的推断框架,能够在广泛的多模态学习问题中构建统计有效且信息丰富的预测集与置信集。GSI利用深度生成模型生成的合成样本来近似条件得分分布,从而在不对数据或任务施加限制性假设的前提下实现精确的不确定性量化。我们通过两个代表性场景实证验证了GSI的性能:大语言模型中的幻觉检测以及图像描述中的不确定性估计。我们的方法在幻觉检测中达到了最先进的性能,并在图像描述中实现了稳健的预测不确定性,其性能受到底层生成模型质量的积极影响。这些发现突显了GSI作为一种通用推断框架的潜力,能够显著增强多模态学习中的不确定性量化与可信度。