When Large Vision Language Models (LVLMs) are applied to multimodal medical generative tasks, they suffer from significant model hallucination issues. This severely impairs the model's generative accuracy, making it challenging for LVLMs to be implemented in real-world medical scenarios to assist doctors in diagnosis. Enhancing the training data for downstream medical generative tasks is an effective way to address model hallucination. Moreover, the limited availability of training data in the medical field and privacy concerns greatly hinder the model's accuracy and generalization capabilities. In this paper, we introduce a method that mimics human cognitive processes to construct fine-grained instruction pairs and apply the concept of chain-of-thought (CoT) from inference scenarios to training scenarios, thereby proposing a method called MedThink. Our experiments on various LVLMs demonstrate that our novel data construction method tailored for the medical domain significantly improves the model's performance in medical image report generation tasks and substantially mitigates the hallucinations. All resources of this work will be released soon.
翻译:当大规模视觉语言模型(LVLMs)应用于多模态医学生成任务时,存在显著的模型幻觉问题。这严重损害了模型的生成准确性,使得LVLMs难以在实际医疗场景中部署以辅助医生诊断。增强下游医学生成任务的训练数据是解决模型幻觉的有效途径。然而,医学领域训练数据的有限性以及隐私问题极大地阻碍了模型的准确性和泛化能力。本文提出了一种模拟人类认知过程的方法来构建细粒度指令对,并将思维链(CoT)的概念从推理场景迁移至训练场景,从而提出了一种名为MedThink的方法。我们在多种LVLMs上的实验表明,这种针对医学领域定制的新型数据构建方法显著提升了模型在医学影像报告生成任务中的性能,并大幅减轻了幻觉现象。本工作的全部资源即将公开。