When Large Vision Language Models (LVLMs) are applied to multimodal medical generative tasks, they suffer from significant model hallucination issues. This severely impairs the model's generative accuracy, making it challenging for LVLMs to be implemented in real-world medical scenarios to assist doctors in diagnosis. Enhancing the training data for downstream medical generative tasks is an effective way to address model hallucination. Moreover, the limited availability of training data in the medical field and privacy concerns greatly hinder the model's accuracy and generalization capabilities. In this paper, we introduce a method that mimics human cognitive processes to construct fine-grained instruction pairs and apply the concept of chain-of-thought (CoT) from inference scenarios to training scenarios, thereby proposing a method called MedThink. Our experiments on various LVLMs demonstrate that our novel data construction method tailored for the medical domain significantly improves the model's performance in medical image report generation tasks and substantially mitigates the hallucinations. All resources of this work will be released soon.
翻译:当大规模视觉语言模型应用于多模态医学生成任务时,存在显著的模型幻觉问题。这严重损害了模型的生成准确性,使得LVLMs难以在实际医疗场景中部署以辅助医生诊断。增强下游医学生成任务的训练数据是解决模型幻觉的有效途径。然而,医学领域训练数据的有限性及隐私顾虑极大地制约了模型的准确性与泛化能力。本文提出一种模拟人类认知过程的方法来构建细粒度指令对,并将思维链概念从推理场景延伸至训练场景,从而提出了名为MedThink的方法。我们在多种LVLMs上的实验表明,这种针对医学领域设计的新型数据构建方法能显著提升模型在医学影像报告生成任务中的性能,并大幅缓解幻觉现象。本工作的全部资源即将公开。