Generalizable Whole Slide Image Classification with Fine-Grained Visual-Semantic Interaction

Whole Slide Image (WSI) classification is often formulated as a Multiple Instance Learning (MIL) problem. Recently, Vision-Language Models (VLMs) have demonstrated remarkable performance in WSI classification. However, existing methods leverage coarse-grained pathogenetic descriptions for visual representation supervision, which are insufficient to capture the complex visual appearance of pathogenetic images, hindering the generalizability of models on diverse downstream tasks. Additionally, processing high-resolution WSIs can be computationally expensive. In this paper, we propose a novel "Fine-grained Visual-Semantic Interaction" (FiVE) framework for WSI classification. It is designed to enhance the model's generalizability by leveraging the interaction between localized visual patterns and fine-grained pathological semantics. Specifically, with meticulously designed queries, we start by utilizing a large language model to extract fine-grained pathological descriptions from various non-standardized raw reports. The output descriptions are then reconstructed into fine-grained labels used for training. By introducing a Task-specific Fine-grained Semantics (TFS) module, we enable prompts to capture crucial visual information in WSIs, which enhances representation learning and augments generalization capabilities significantly. Furthermore, given that pathological visual patterns are redundantly distributed across tissue slices, we sample a subset of visual instances during training. Our method demonstrates robust generalizability and strong transferability, dominantly outperforming the counterparts on the TCGA Lung Cancer dataset with at least 9.19% higher accuracy in few-shot experiments. The code is available at: https://github.com/ls1rius/WSI_FiVE.

翻译：全切片病理图像（WSI）分类通常被建模为多实例学习（MIL）问题。近年来，视觉-语言模型（VLM）在WSI分类中展现出卓越性能。然而现有方法采用粗粒度病理描述进行视觉表征监督，难以捕捉病理图像复杂的视觉表现，阻碍了模型在不同下游任务上的泛化能力。此外，高分辨率WSI的加载计算成本高昂。本文提出新型"细粒度视觉-语义交互"（FiVE）框架用于WSI分类，通过利用局部视觉模式与细粒度病理语义的交互增强模型泛化能力。具体而言，我们首先设计精细查询，利用大语言模型从各类非标准化原始报告中提取细粒度病理描述；随后将输出描述重构为用于训练的细粒度标签。通过引入任务特定细粒度语义（TFS）模块，使得提示能够捕获WSI中的关键视觉信息，显著增强表征学习与泛化能力。进一步地，鉴于病理视觉模式在组织切片中呈冗余分布，我们在训练过程中仅对视觉实例子集进行采样。该方法展现出稳健的泛化性与强迁移能力，在TCGA肺癌数据集的小样本实验中以至少9.19%的准确率优势全面超越现有方法。代码已开源：https://github.com/ls1rius/WSI_FiVE