PubMed-Ophtha: An open resource for training ophthalmology vision-language models on scientific literature

from arxiv, 12 pages, 4 figures, 3 supplementary figures. Dataset available at https://huggingface.co/datasets/pubmed-ophtha/PubMed-Ophtha. Code available at https://github.com/berenslab/pubmed-ophtha

Vision-language models hold considerable promise for ophthalmology, but their development depends on large-scale, high-quality image-text datasets that remain scarce. We present PubMed-Ophtha, a hierarchical dataset of 102,023 ophthalmological image-caption pairs extracted from 15,842 open-access articles in PubMed Central. Unlike existing datasets, figures are extracted directly from article PDFs at full resolution and decomposed into their constituent panels, panel identifiers, and individual images. Each image is annotated with its imaging modality -- color fundus photography, optical coherence tomography, retinal imaging, or other -- and a mark status indicating the presence of annotation marks such as arrows. Figure captions are split into panel-level subcaptions using a two-step LLM approach, achieving a mean average sentence BLEU score of 0.913 on human-annotated data. Panel and image detection models reach a [email protected] of 0.909 and 0.892, respectively, and figure extraction achieves a median IoU of 0.997. To support reproducibility, we additionally release the human-annotated ground-truth data, all trained models, and the full dataset generation pipeline.

翻译：摘要：视觉语言模型在眼科领域展现出巨大潜力，但其开发依赖于大规模、高质量的图像-文本数据集，此类数据集仍然稀缺。我们提出PubMed-Ophtha，这是一个层级化数据集，包含从PubMed Central中15,842篇开放获取文章中提取的102,023对眼科图像-标题对。与现有数据集不同，该数据集的图像直接从PDF文档中以全分辨率提取，并分解为组成面板、面板标识符及单个图像。每张图像均标注其成像模态——彩色眼底摄影、光学相干断层扫描、视网膜成像或其他——以及标记状态（如箭头等注释标记的存在）。通过采用两步大语言模型方法，图形标题被拆分为面板级子标题，在人工标注数据上实现了平均句子BLEU得分为0.913。面板检测模型与图像检测模型分别达到0.909和0.892的[email protected]，图像提取的中位IoU为0.997。为确保可复现性，我们额外发布了人工标注的真实数据、所有训练完成的模型以及完整的数据集生成流程。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

在无标注条件下适配视觉—语言模型：全面综述

专知会员服务

13+阅读 · 2025年8月9日

面向视觉语言模型的持续学习：遗忘之外的综述与分类体系

专知会员服务

21+阅读 · 2025年8月9日

高效视觉语言模型研究综述

专知会员服务

14+阅读 · 2025年4月18日

《Med3DVLM：面向三维医学图像分析的高效视觉-语言模型》

专知会员服务

9+阅读 · 2025年3月27日