Generalizable Whole Slide Image Classification with Fine-Grained Visual-Semantic Interaction

Whole Slide Image (WSI) classification is often formulated as a Multiple Instance Learning (MIL) problem. Recently, Vision-Language Models (VLMs) have demonstrated remarkable performance in WSI classification. However, existing methods leverage coarse-grained pathogenetic descriptions for visual representation supervision, which are insufficient to capture the complex visual appearance of pathogenetic images, hindering the generalizability of models on diverse downstream tasks. Additionally, processing high-resolution WSIs can be computationally expensive. In this paper, we propose a novel "Fine-grained Visual-Semantic Interaction" (FiVE) framework for WSI classification. It is designed to enhance the model's generalizability by leveraging the interplay between localized visual patterns and fine-grained pathological semantics. Specifically, with meticulously designed queries, we start by utilizing a large language model to extract fine-grained pathological descriptions from various non-standardized raw reports. The output descriptions are then reconstructed into fine-grained labels used for training. By introducing a Task-specific Fine-grained Semantics (TFS) module, we enable prompts to capture crucial visual information in WSIs, which enhances representation learning and augments generalization capabilities significantly. Furthermore, given that pathological visual patterns are redundantly distributed across tissue slices, we sample a subset of visual instances during training. Our method demonstrates robust generalizability and strong transferability, dominantly outperforming the counterparts on the TCGA Lung Cancer dataset with at least 9.19% higher accuracy in few-shot experiments.

翻译：全切片图像（WSI）分类通常被建模为多示例学习（MIL）问题。近年来，视觉-语言模型（VLM）在WSI分类中展现出卓越性能。然而，现有方法采用粗粒度的病理描述进行视觉表征监督，难以捕捉病理图像复杂的视觉表现，限制了模型在多样化下游任务中的泛化能力。此外，处理高分辨率WSI会导致计算成本高昂。本文提出一种新颖的"细粒度视觉-语义交互"（FiVE）框架用于WSI分类。该框架通过利用局部视觉模式与细粒度病理语义之间的交互作用，旨在提升模型的泛化能力。具体而言，我们借助精心设计的查询，首先利用大语言模型从各类非标准化原始报告中提取细粒度病理描述，随后将输出的描述重构为用于训练的细粒度标签。通过引入任务特定细粒度语义（TFS）模块，我们使提示能够捕捉WSI中的关键视觉信息，从而增强表征学习并显著提升泛化能力。此外，考虑到病理视觉模式在组织切片中呈冗余分布，我们在训练过程中对视觉实例子集进行采样。实验表明，本方法展现出强大的泛化能力和可迁移性，在TCGA肺癌数据集的小样本实验中以至少9.19%的准确率优势显著超越现有方法。