Interpretability is significant in computational pathology, leading to the development of multimodal information integration from histopathological image and corresponding text data.However, existing multimodal methods have limited interpretability due to the lack of high-quality dataset that support explicit reasoning and inference and simple reasoning process.To address the above problems, we introduce a novel multimodal pathology large language model with strong reasoning capabilities.To improve the generation of accurate and contextually relevant textual descriptions, we design a semantic reward strategy integrated with group relative policy optimization.We construct a high-quality pathology visual question answering (VQA) dataset, specifically designed to support complex reasoning tasks.Comprehensive experiments conducted on this dataset demonstrate that our method outperforms state-of-the-art methods, even when trained with only 20% of the data.Our method also achieves comparable performance on downstream zero-shot image classification task compared with CLIP.
翻译:在计算病理学中,可解释性至关重要,这推动了从组织病理学图像及相应文本数据中整合多模态信息的发展。然而,现有多模态方法由于缺乏支持显式推理与推断的高质量数据集以及推理过程过于简单,其可解释性有限。为解决上述问题,我们引入了一种具有强大推理能力的新型多模态病理学大语言模型。为提升生成准确且上下文相关的文本描述的能力,我们设计了一种与分组相对策略优化相结合的语义奖励策略。我们构建了一个高质量的病理学视觉问答数据集,该数据集专门设计用于支持复杂推理任务。在该数据集上进行的全面实验表明,即使仅使用20%的数据进行训练,我们的方法也优于现有最先进的方法。与CLIP相比,我们的方法在下游零样本图像分类任务上也取得了相当的性能。