Multimodal large language models (MLLMs) have demonstrated strong capabilities in vision-language understanding and natural-language response generation. However, these systems can still produce overconfident predictions and hallucination-like outputs, particularly when the visual evidence is weak, ambiguous, or semantically inconsistent. Most existing approaches focus on improving multimodal representation alignment or retrieval-augmented generation, while providing limited mechanisms to quantify instance-level prediction reliability or identify incorrect visual outputs. This work proposes a retrieval-augmented reliability-aware inference framework for trustworthy multimodal visual understanding. The proposed framework constructs an external visual evidence database using pretrained visual embeddings and nearest-neighbor retrieval over normalized feature representations. Retrieved evidence is used to estimate prediction trustworthiness through multiple reliability indicators, including similarity strength, class-support agreement, evidence margin, entropy-based uncertainty, and an aggregate reliability score. Based on these signals, a decision gate determines whether the system should accept the prediction, answer with caution, or abstain/fallback when evidence is insufficient. A multimodal response-generation layer then produces a final user-facing response conditioned on the reliability decision. Experiments on ImageNet-100 demonstrate that the proposed reliability-aware framework improves accepted prediction accuracy from 85.84\% to 88.88\% at 89.04\% coverage. The hallucination-like accepted wrong-answer rate is reduced from 14.16\% to 11.12\%. These results show that integrating retrieval evidence, reliability estimation, and selective decision gating can improve calibration and reduce overconfident visual errors without retraining large multimodal models.
翻译:多模态大语言模型(MLLMs)在视觉-语言理解与自然语言响应生成方面展现出强大的能力。然而,当视觉证据较弱、模糊或语义不一致时,这些系统仍可能产生过度自信的预测及类似幻觉的输出。现有方法大多侧重于改进多模态表示对齐或检索增强生成,但在量化实例级预测可靠性或识别错误视觉输出方面提供的机制有限。本文提出一种面向可信多模态视觉理解的检索增强可靠性感知推理框架。该框架利用预训练视觉嵌入及基于归一化特征表示的最近邻检索构建外部视觉证据库。通过多个可靠性指标(包括相似性强度、类别支持一致性、证据裕度、基于熵的不确定性及聚合可靠性评分)评估检索证据的预测可信度。根据这些信号,决策门控机制决定系统应接受预测、谨慎作答,还是在证据不足时弃权/回退。多模态响应生成层随后根据可靠性决策生成面向用户的最终响应。在ImageNet-100上的实验表明,所提可靠性感知框架在89.04%的覆盖度下将可接受预测准确率从85.84%提升至88.88%,类似幻觉的可接受错误答案率从14.16%降至11.12%。这些结果表明,集成检索证据、可靠性估计与选择性决策门控,可在无需重新训练大规模多模态模型的情况下改善模型校准度并减少过度自信的视觉错误。