Despite the progress made by multimodal large language models (MLLMs) in computational pathology, they remain limited by a predominant focus on patch-level analysis, missing essential contextual information at the whole-slide level. The lack of large-scale instruction datasets and the gigapixel scale of whole slide images (WSIs) pose significant developmental challenges. In this paper, we present SlideChat, the first vision-language assistant capable of understanding gigapixel whole-slide images, exhibiting excellent multimodal conversational capability and response complex instruction across diverse pathology scenarios. To support its development, we created SlideInstruction, the largest instruction-following dataset for WSIs consisting of 4.2K WSI captions and 176K VQA pairs with multiple categories. Furthermore, we propose SlideBench, a multimodal benchmark that incorporates captioning and VQA tasks to assess SlideChat's capabilities in varied clinical settings such as microscopy, diagnosis. Compared to both general and specialized MLLMs, SlideChat exhibits exceptional capabilities achieving state-of-the-art performance on 18 of 22 tasks. For example, it achieved an overall accuracy of 81.17% on SlideBench-VQA (TCGA), and 54.15% on SlideBench-VQA (BCNB). We will fully release SlideChat, SlideInstruction and SlideBench as open-source resources to facilitate research and development in computational pathology.
翻译:尽管多模态大语言模型(MLLMs)在计算病理学领域取得了进展,但它们仍主要局限于对图像块(patch)级别的分析,缺乏全切片层面的关键上下文信息。大规模指令数据集的缺乏以及全切片图像(WSIs)的千兆像素级规模,构成了重大的发展挑战。本文提出了SlideChat,这是首个能够理解千兆像素级全切片图像的视觉语言助手,它在多种病理学场景中展现出卓越的多模态对话能力以及对复杂指令的响应能力。为支持其开发,我们创建了SlideInstruction,这是目前最大的针对WSIs的指令遵循数据集,包含4.2K个WSI描述和17.6万个涵盖多个类别的视觉问答(VQA)对。此外,我们提出了SlideBench,这是一个集成了图像描述和VQA任务的多模态基准,用于评估SlideChat在显微观察、诊断等多种临床环境下的能力。与通用及专用MLLMs相比,SlideChat展现出卓越的性能,在22项任务中的18项上达到了最先进的水平。例如,它在SlideBench-VQA(TCGA)上的总体准确率达到81.17%,在SlideBench-VQA(BCNB)上达到54.15%。我们将完全开源SlideChat、SlideInstruction和SlideBench,以促进计算病理学领域的研究与发展。