Enhancing Pathological VLMs with Cross-scale Reasoning

Pathological images are inherently multi-scale, requiring pathologists to integrate evidence from global tissue architecture at low magnification to cellular morphology at higher magnification for accurate diagnosis. While existing pathological datasets for vision-language model (VLM) include various scales, they often lack an explicit cross-scale reasoning objective. This limitation prevents VLMs from capturing essential cross-scale representations and learning evidence-based reasoning. To bridge this gap, we introduce the first cross-scale training and evaluation paradigm that formulates pathology interpretation as multi-magnification reasoning. However, creating such a task reveals a critical challenge: multi-image visual question answering (VQA) is prone to text-only shortcuts, which allow models to guess answers using magnification-dependent artifacts rather than visual evidence. To address this, we propose a leakage-aware curation pipeline that combines adversarial text-only screening with constraint-guided question design. Using this pipeline, we construct Scale-VQA, a high-quality benchmark with 4,685 multiple-choice questions grounded in 2,537 pathology images across multiple magnification levels. Finally, we present ScaleReasoner-R1, a model trained via reinforcement learning to optimize performance on the cross-scale VQA task. ScaleReasoner-R1 achieves state-of-the-art performance on our cross-scale reasoning benchmark and generalizes to SOTA performance on established single-scale benchmarks. Findings suggest that even the limited cross-scale supervision can significantly improve pathological understanding. The code and demos will be open-sourced.

翻译：病理图像本质上具有多尺度特性，要求病理学家综合低倍镜下的全局组织结构与高倍镜下的细胞形态学证据进行精准诊断。尽管现有的视觉语言模型（VLM）病理数据集包含了多种尺度，但往往缺乏明确的跨尺度推理目标。这一局限使得VLM无法捕获关键的跨尺度表征并学习基于证据的推理。为弥补这一空白，我们首次提出将病理学解释建模为多倍率推理的跨尺度训练与评估范式。然而，构建此类任务揭示了关键挑战：多图像视觉问答（VQA）易陷入纯文本捷径，使模型能利用倍率相关伪影而非视觉证据猜测答案。为此，我们提出一种泄漏感知数据构建流程，结合对抗性纯文本筛选与约束引导的问题设计。运用该流程，我们构建了Scale-VQA——一个高质量基准数据集，包含4,685道基于2,537幅覆盖多个倍率水平病理图像的多选题。最后，我们提出ScaleReasoner-R1模型，通过强化学习训练以优化跨尺度VQA任务性能。该模型在跨尺度推理基准上达到最优性能，并在已有单尺度基准上实现通用最优表现。研究结果表明，即使是有限的跨尺度监督也能显著提升病理理解能力。代码与演示将开源。