Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos

The gigapixel scale of whole slide images (WSIs) poses a challenge for histopathology multi-modal chatbots, requiring a global WSI analysis for diagnosis, compounding evidence from different WSI patches. Current visual instruction datasets, generated through large language models, focus on creating question/answer pairs for individual image patches, which may lack diagnostic capacity on their own in histopathology, further complicated by the absence of spatial grounding in histopathology image captions. To bridge this gap, we introduce Quilt-Instruct, a large-scale dataset of 107,131 histopathology-specific instruction question/answer pairs, that is collected by leveraging educational histopathology videos from YouTube, which provides spatial localization of captions by automatically extracting narrators' cursor movements. In addition, we provide contextual reasoning by extracting diagnosis and supporting facts from the entire video content to guide the extrapolative reasoning of GPT-4. Using Quilt-Instruct, we train Quilt-LLaVA, which can reason beyond the given single image patch, enabling diagnostic reasoning and the capability of spatial awareness. To evaluate Quilt-LLaVA, we propose a comprehensive evaluation dataset created from 985 images and 1283 human-generated question-answers. We also thoroughly evaluate Quilt-LLaVA using public histopathology datasets, where Quilt-LLaVA significantly outperforms SOTA by over 10% on relative GPT-4 score and 4% and 9% on open and closed set VQA. Our code, data, and model are publicly available at quilt-llava.github.io.

翻译：全切片图像的千兆像素规模对组织病理学多模态聊天机器人构成挑战，需要全局WSI分析以整合不同图像块的诊断证据。当前通过大语言模型生成的视觉指令数据集，主要针对单个图像块创建问答对，这些数据在组织病理学中可能缺乏独立的诊断能力，且因组织病理学图像描述缺乏空间定位而更加复杂。为弥合这一差距，我们提出Quilt-Instruct——包含107,131个组织病理学专用指令问答对的大规模数据集。该数据集通过利用YouTube上的教育性组织病理学视频构建，通过自动提取叙述者的光标移动实现描述文本的空间定位。此外，我们通过从整个视频内容中提取诊断依据和支撑事实来指导GPT-4的外推推理，提供上下文推理能力。基于Quilt-Instruct训练的Quilt-LLaVA模型，能够超越单一图像块的限制进行推理，实现诊断推理与空间感知能力。为评估Quilt-LLaVA，我们提出包含985张图像和1283个人工生成问答对的综合评估数据集。我们还通过公开组织病理学数据集进行全面评估，结果显示Quilt-LLaVA在相对GPT-4评分上显著优于现有最佳模型超过10%，在开放集和封闭集VQA任务上分别提升4%和9%。我们的代码、数据和模型已在quilt-llava.github.io公开提供。