Recent advancements in Multimodal Large Language Models (MLLMs) have achieved significant success across various domains. However, their use in high-stakes fields like healthcare remains limited due to persistent hallucinations, where generated text contradicts or ignores visual input. We contend that MLLMs can comprehend images but struggle to produce accurate token sequences. Minor perturbations can shift attention from truthful to untruthful states, and the autoregressive nature of text generation often prevents error correction. To address this, we propose SchröMind-a novel framework reducing hallucinations via solving the Schrödinger bridge problem. It establishes a token-level mapping between hallucinatory and truthful activations with minimal transport cost through lightweight training, while preserving the model's original capabilities. Extensive experiments on the POPE and MME benchmarks demonstrate the superiority of Schrödinger, which achieves state-of-the-art performance while introducing only minimal computational overhead.
翻译:近年来,多模态大语言模型(MLLMs)在各领域取得了显著成功。然而,由于持续存在的幻觉问题——即生成文本与视觉输入相矛盾或忽略视觉内容,其在医疗等高风险领域的应用仍受限。我们认为,MLLMs能够理解图像,但难以生成准确的词元序列。微小的扰动就可能导致注意力从真实状态转向虚假状态,且文本生成的自回归特性常阻碍错误修正。为解决这一问题,我们提出SchröMind——一种通过求解薛定谔桥问题来减少幻觉的新型框架。该框架通过轻量化训练,以最小传输成本在幻觉激活与真实激活之间建立词元级映射,同时保持模型原有能力。在POPE和MME基准上的大量实验证明了SchröMind的优越性,其在仅引入极小计算开销的同时实现了最先进的性能。