Pathology reports are structured, multi-granular documents encoding diagnostic conclusions, histological grades, and ancillary test results across one or more anatomical sites; yet existing pathology vision-language models (VLMs) reduce this output to a flat label or free-form text. We present HiPath, a lightweight VLM framework built on frozen UNI2 and Qwen3 backbones that treats structured report prediction as its primary training objective. Three trainable modules totalling 15M parameters address complementary aspects of the problem: a Hierarchical Patch Aggregator (HiPA) for multi-image visual encoding, Hierarchical Contrastive Learning (HiCL) for cross-modal alignment via optimal transport, and Slot-based Masked Diagnosis Prediction (Slot-MDP) for structured diagnosis generation. Trained on 749K real-world Chinese pathology cases from three hospitals, HiPath achieves 68.9% strict and 74.7% clinically acceptable accuracy with a 97.3% safety rate, outperforming all baselines under the same frozen backbone. Cross-hospital evaluation confirms generalisation with only a 3.4pp drop in strict accuracy while maintaining 97.1% safety.
翻译:病理报告是结构化、多粒度的文档,编码了跨一个或多个解剖部位的诊断结论、组织学分级及辅助检测结果;然而现有病理视觉语言模型(VLM)将此输出简化为扁平化的标签或自由文本。我们提出HiPath,一种基于冻结UNI2和Qwen3骨干网络的轻量级VLM框架,将结构化报告预测作为其主要训练目标。三个总计1500万可训练参数模块分别处理问题的互补方面:用于多图像视觉编码的分层补丁聚合器(HiPA)、通过最优传输实现跨模态对齐的分层对比学习(HiCL),以及用于结构化解码生成的槽位掩码诊断预测(Slot-MDP)。在来自三家医院的749K例真实中国病理病例上训练后,HiPath在相同冻结骨干网络下达到68.9%严格准确率、74.7%临床可接受准确率及97.3%安全率,优于所有基线模型。跨医院评估证实其泛化能力,严格准确率仅下降3.4个百分点,同时保持97.1%安全率。